"Fixes" for page flipping under PRIME on AMD & nouveau
Mario Kleiner
mario.kleiner.de at gmail.com
Wed Aug 17 23:51:07 UTC 2016
On 08/17/2016 07:43 PM, Alex Deucher wrote:
> On Wed, Aug 17, 2016 at 12:35 PM, Mario Kleiner
> <mario.kleiner.de at gmail.com> wrote:
>> On 08/17/2016 06:27 PM, Christian König wrote:
>>>>
>>>> AMD uses copy swaps because radeon/amdgpu kms can't switch the
>>>> scanout mode from tiled to linear on the fly during flips.
>>>
>>> Well I'm not an expert on this, but as far as I know the bigger problem
>>> is that the dedicated AMD hardware generations you are targeting usually
>>> can't reliable scanout from system memory without a rather complicated
>>> setup.
>>>
>>> So that is a complete NAK to the radeon changes.
>>
>>
>> Hi Christian,
>>
>> thanks for the feedback, but i think that's a misunderstanding. The patches
>> don't make them scanout from system memory, they just enforce a fresh copy
>> from RAM/GTT -> VRAM before scanning out a buffer again. I just assume there
>> is a more elegant/clean way than this "fake" pin/unpin to GTT to essentially
>> tell the driver that its current VRAM content is stale and needs a refresh
>> from the up to date dmabuf in system RAM.
>>
>
> I think the ddx should handle the copy rather than the kernel. That
> also takes care of the tiling. I.e., copy from the linear shared
> buffer in system memory to the tiled scanout buffer in vram. The ddx
> should also be able to take damage into account and only copy the
> delta. From a bandwidth perspective, I'm not sure how much sense
> pageflipping makes since there are so many copies already.
>
> Alex
That's what the ati-ddx/amdgpu-ddx does at the moment, as it detects the
mismatch in tiling flags and uses the DRI3/Present copy path instead of
the pageflip path. The problem is that the servers Present
implementation doesn't request a vsync'ed start of the copy operation
and the whole procedure is too slow to keep ahead of the scanout, so it
tears pretty badly for many animations. Also no page flipping = no
reliable timestamps. And the modesetting ddx doesn't handle it at all,
as it doesn't know about the tiling mismatch.
You are right, going through page flipping doesn't save any bandwith,
may even use more without damage handling, but it prevents tearing and
undefined presentation timing.
So it sounds as if the bug is not that page flipping doesn't quite work
without my hack, but that i even managed to get this far?
There is this other approach from NVidia's Alex Goins for their
proprietary driver, whose patches landed in the X-Server 1.19 master
branch a couple of weeks ago. I haven't read his patches in detail yet,
and i so far couldn't successfully test them with the reference
implementation in modesetting ddx 1.19. Afaik there the display gpu
exports a pair of scanout friendly, page flipping compatible dmabufs (i
assume linear, contiguous, accessible by the display engines), and the
offload gpu imports those and renders into them. That saves one extra
copy, so should be somewhat more efficient.
Setting it up seems to be more involved and less flexible though. So far
i couldn't make it work here for testing. Maybe bugs, maybe mistakes on
my side, maybe i just have the wrong hardware for it. Need to read the
patches first in detail to understand how it is supposed to work.
-mario
>
>> Btw. i'll be offline for the next few hours, just wanted to get this out
>> now.
>>
>> thanks,
>> -mario
>>
>>
>>>
>>> Regards,
>>> Christian.
>>>
>>> Am 17.08.2016 um 18:12 schrieb Mario Kleiner:
>>>>
>>>> Hi,
>>>>
>>>> i spent some time playing with DRI3/Present + PRIME for testing
>>>> how well it works for Optimus/Enduro style setups wrt. page flipping
>>>> on the current kernel/mesa/xorg. I want page flipping, because
>>>> neuroscience/medical applications need the reliable timing/timestamping
>>>> and tear free presentation we currently only can get via page
>>>> flipping, but not the copyswap path.
>>>>
>>>> Intel as display gpu + nouveau for render offload worked nicely
>>>> on intel-ddx with page flipping, proper timing, dmabuf fence sync
>>>> and all.
>>>>
>>>> AMD uses copy swaps because radeon/amdgpu kms can't switch the
>>>> scanout mode from tiled to linear on the fly during flips. That's
>>>> a todo in itself. For the moment i used the ati-ddx with Option
>>>> "ColorTiling/ColorTiling2D" "off" to force my pair of old Radeon
>>>> HD-5770's into linear mode so page flipping can be used for
>>>> prime. The current modesetting-ddx will use page flipping in
>>>> any case as it doesn't detect the tiling format mismatch.
>>>>
>>>> nouveau uses page flips.
>>>>
>>>> Turns out that prime + page flipping currently doesn't work
>>>> on nouveau and amd. The first offload rendered images from
>>>> the imported dmabufs show up properly, but then the display
>>>> is stuck alternating between the first two or three rendered
>>>> frames.
>>>>
>>>> The problem is that during the pageflip ioctl we pin the
>>>> dmabuf into VRAM in preparation for scanout, then unpin it
>>>> when we are done with it at next flip, but the buffer stays
>>>> in the VRAM memory domain. Next time we flip to the buffer
>>>> again, the driver skips the DMA copy from GTT to VRAM during
>>>> pinning, because the buffers content apparently already resides
>>>> in VRAM. Therefore it doesn't update the VRAM copy with the updated
>>>> dmabuf content in system RAM, so freshly rendered frames from the
>>>> prime export/render offload gpu never reach the display gpu and one
>>>> only sees stale images.
>>>>
>>>> The attached patches for nouveau and radeon kms seem to work
>>>> pretty ok, page flipping works, display updates, tear-free,
>>>> dmabuf fence sync works, onset timing/timestamping is correct.
>>>> They simply pin the buffer back into GTT, then unpin, to force
>>>> a move of the buffer into the GTT domain, and thereby force the
>>>> following pin to do a new copy from GTT -> VRAM. The code tries
>>>> to avoid a useless copy from VRAM -> GTT during the pin op.
>>>>
>>>> However, the approach feels very much like a hack, so i assume
>>>> this is not the proper way of doing it? I looked what ttm has
>>>> to offer, but couldn't find anything elegant and obvious. Maybe
>>>> there is a way to evict a bo without actually copying data back
>>>> to RAM? Or to invalidate the VRAM copy as stale? Maybe i just
>>>> missed something, as i'm not very familiar with ttm.
>>>>
>>>> Thoughts or suggestions?
>>>>
>>>> Another insight with my hacks is so far that nouveau seems to
>>>> be fast as prime exporter/renderoffload, but rather slow as
>>>> display gpu/prime importer, as tested on a 2008 or 2009
>>>> MacBookPro dual-Nvidia laptop.
>>>>
>>>> AMD, as tested with dual Radeon HD-5770 seems to be fast as prime
>>>> importer/display gpu, but very slow as prime exporter/render offload,
>>>> e.g., taking 16 msecs to get a 1920x1080 framebuffer into RAM. Seems
>>>> that Mesa's blitImage function is the slow bit here. On r600 it seems
>>>> to draw a textured triangle strip to detile the gpu renderbuffer and
>>>> copy it into GTT. As drawing a textured fullscreen quad is normally
>>>> much faster, something special seems to be going on there wrt. DMA?
>>>> However, i don't have a realistic real Enduro test setup with AMD
>>>> iGPU + dGPU, only this cobbled together dual HD-5770's in a MacPro,
>>>> so this could be wrong.
>>>>
>>>> thanks,
>>>> -mario
>>>>
>>>> _______________________________________________
>>>> dri-devel mailing list
>>>> dri-devel at lists.freedesktop.org
>>>> https://lists.freedesktop.org/mailman/listinfo/dri-devel
>>>
>>>
>>>
>> _______________________________________________
>> dri-devel mailing list
>> dri-devel at lists.freedesktop.org
>> https://lists.freedesktop.org/mailman/listinfo/dri-devel
More information about the dri-devel
mailing list