"Fixes" for page flipping under PRIME on AMD & nouveau
Mario Kleiner
mario.kleiner.de at gmail.com
Fri Aug 26 19:57:19 UTC 2016
To pick this up again after a week of manic testing :)
On 08/18/2016 04:23 AM, Michel Dänzer wrote:
> On 18/08/16 01:12 AM, Mario Kleiner wrote:
>>
>> Intel as display gpu + nouveau for render offload worked nicely
>> on intel-ddx with page flipping, proper timing, dmabuf fence sync
>> and all.
>
> How about with AMD instead of nouveau in this case?
>
I don't have any real AMD Enduro laptop with either Intel + AMD or AMD +
AMD atm., so i tested with my hacked up setups, but there things look
very good:
a) A standard PC with Intel Haswell + AMD Tonga Pro R9 380. Seems to
work correctly, page-flipping used, no visual artifacts or other
problems, my measurement equipment also shows perfect timing and no
glitches. Performance is very good, even without Marek's recent SDMA +
PRIME patch series. Seems though with his patches some of the many
criterions for using it doesn't get satisfied so it uses a fallback path
on my machine.
One thing that confuses me so far is that visual results and measurment
suggest it works nicely, properly serializing the rendering/detiling
blit and the pageflip. But when i ftrace the Intel drivers
reservation_object_wait_timeout_rcu() call where it normally waits for
the dmabuf fence to complete then i never see it blocking for more than
a few dozen microseconds, and i couldn't find any other place where it
blocks on detiling blit completion yet. Iow. it seems to work correctly
in practice, but i don't know where it actually blocks. Could also be
that the flip work func in intels driver just executes after the
detiling blit has already completed.
b) A MacPro with dual Radeon HD-5770 and NVidia GeForce, and my pageflip
hacks applied. I ported Marek's Mesa SDMA patch to r600, and with that i
get very good performance for AMD Evergreen as renderoffload gpu both
for the NVidia + AMD and AMD + AMD combo. So this solved the performance
problems on the older gpus. I assume Intel + old radeon-kms would just
behave equally well. So thanks Marek, that was perfect!
I guess that means we are really good now wrt. renderoffload whenever an
Intel iGPU is used for display, regardless if nouveau or AMD is used as
dGPU :)
>
>> Turns out that prime + page flipping currently doesn't work
>> on nouveau and amd. The first offload rendered images from
>> the imported dmabufs show up properly, but then the display
>> is stuck alternating between the first two or three rendered
>> frames.
>>
>> The problem is that during the pageflip ioctl we pin the
>> dmabuf into VRAM in preparation for scanout, then unpin it
>> when we are done with it at next flip, but the buffer stays
>> in the VRAM memory domain.
>
> Sounds like you found a bug here: BOs which are being shared between
> different GPUs should always be pinned to GTT, moving them to VRAM (and
> consequently the page flip) should fail.
>
Seems so, although i hoped i was fixing a bug, not exploiting a
loophole. In practice i haven't observed trouble with the hack so far. I
havent't looked deeply enough into how the dma api below dmabuf
operates, so this is just guesswork, but i suspect the reason that this
doesn't blow up in an obvious way is that if the render offload gpu
exports the dmabuf then the pages get pinned/locked into system RAM, so
the pages can't move around or get paged out to swap, as long as the
dmabuf stays exported. When the dmabuf importing AMD or nouveau display
gpu then moves the bo from GTT to VRAM (or pseudo-moves it back with my
hack) all that changes is some pin refcount for the RAM pages, but the
refcount always stays non-zero and system RAM isn't freed or moved
around during the session. I just wonder if this bug couldn't somehow be
turned into a proper feature?
I'm tempted to keep my patches as a temporary stop gap measure in some
kernel on GitHub, so my users could use them to get NVidia+NVidia or at
least old AMD+AMD setups with radeon-kms + ati-ddx working well enough
for their research work until some proper solution comes around. But if
you think there is some major way how this could blow up, corrupt data,
hang/crash during normal use then better not. I don't know how many of
my users have such systems, as my advice to them so far was to "stay the
hell away from anything with hybrid graphics/Optimus/Enduro in its name
if they value their work". Now i could change my purchase advice to
"anything hybrid with a Intel iGPU is probably ok in terms of
correctness/timing/performance for not too demanding performance needs".
> The latest versions of DCE support scanning out from GTT, so that might
> be a good solution at least for Carrizo and newer APUs, not sure it
> makes sense for dGPUs though.
That would be good to have. But that means DCE-11 or later only? What is
the constraint on older parts, does it need contiguous memory? I
personally don't care about the dGPU case, i only use these dGPUs for
testing because i don't have access to any real Enduro laptops with APUs.
-mario
>
>
>> AMD, as tested with dual Radeon HD-5770 seems to be fast as prime
>> importer/display gpu, but very slow as prime exporter/render offload,
>> e.g., taking 16 msecs to get a 1920x1080 framebuffer into RAM. Seems
>> that Mesa's blitImage function is the slow bit here. On r600 it seems
>> to draw a textured triangle strip to detile the gpu renderbuffer and
>> copy it into GTT. As drawing a textured fullscreen quad is normally
>> much faster, something special seems to be going on there wrt. DMA?
>
> Maybe the rasterization as two triangles results in bad PCIe bandwidth
> utilization. Using the asynchronous DMA engine for these transfers would
> probably be ideal, but having the 3D engine rasterize a single rectangle
> (either using the rectangle primitive or a large triangle with scissor)
> might already help.
>
>
More information about the dri-devel
mailing list