"Fixes" for page flipping under PRIME on AMD & nouveau

Fri Aug 26 19:57:19 UTC 2016

To pick this up again after a week of manic testing :)

On 08/18/2016 04:23 AM, Michel Dänzer wrote:
> On 18/08/16 01:12 AM, Mario Kleiner wrote:
>>
>> Intel as display gpu + nouveau for render offload worked nicely
>> on intel-ddx with page flipping, proper timing, dmabuf fence sync
>> and all.
>
> How about with AMD instead of nouveau in this case?
>

I don't have any real AMD Enduro laptop with either Intel + AMD or AMD + 
AMD atm., so i tested with my hacked up setups, but there things look 
very good:

a) A standard PC with Intel Haswell + AMD Tonga Pro R9 380. Seems to 
work correctly, page-flipping used, no visual artifacts or other 
problems, my measurement equipment also shows perfect timing and no 
glitches. Performance is very good, even without Marek's recent SDMA + 
PRIME patch series. Seems though with his patches some of the many 
criterions for using it doesn't get satisfied so it uses a fallback path 
on my machine.

One thing that confuses me so far is that visual results and measurment 
suggest it works nicely, properly serializing the rendering/detiling 
blit and the pageflip. But when i ftrace the Intel drivers 
reservation_object_wait_timeout_rcu() call where it normally waits for 
the dmabuf fence to complete then i never see it blocking for more than 
a few dozen microseconds, and i couldn't find any other place where it 
blocks on detiling blit completion yet. Iow. it seems to work correctly 
in practice, but i don't know where it actually blocks. Could also be 
that the flip work func in intels driver just executes after the 
detiling blit has already completed.

b) A MacPro with dual Radeon HD-5770 and NVidia GeForce, and my pageflip 
hacks applied. I ported Marek's Mesa SDMA patch to r600, and with that i 
get very good performance for AMD Evergreen as renderoffload gpu both 
for the NVidia + AMD and AMD + AMD combo. So this solved the performance 
problems on the older gpus. I assume Intel + old radeon-kms would just 
behave equally well. So thanks Marek, that was perfect!

I guess that means we are really good now wrt. renderoffload whenever an 
Intel iGPU is used for display, regardless if nouveau or AMD is used as 
dGPU :)

>
>> Turns out that prime + page flipping currently doesn't work
>> on nouveau and amd. The first offload rendered images from
>> the imported dmabufs show up properly, but then the display
>> is stuck alternating between the first two or three rendered
>> frames.
>>
>> The problem is that during the pageflip ioctl we pin the
>> dmabuf into VRAM in preparation for scanout, then unpin it
>> when we are done with it at next flip, but the buffer stays
>> in the VRAM memory domain.
>
> Sounds like you found a bug here: BOs which are being shared between
> different GPUs should always be pinned to GTT, moving them to VRAM (and
> consequently the page flip) should fail.
>

Seems so, although i hoped i was fixing a bug, not exploiting a 
loophole. In practice i haven't observed trouble with the hack so far. I 
havent't looked deeply enough into how the dma api below dmabuf 
operates, so this is just guesswork, but i suspect the reason that this 
doesn't blow up in an obvious way is that if the render offload gpu 
exports the dmabuf then the pages get pinned/locked into system RAM, so 
the pages can't move around or get paged out to swap, as long as the 
dmabuf stays exported. When the dmabuf importing AMD or nouveau display 
gpu then moves the bo from GTT to VRAM (or pseudo-moves it back with my 
hack) all that changes is some pin refcount for the RAM pages, but the 
refcount always stays non-zero and system RAM isn't freed or moved 
around during the session. I just wonder if this bug couldn't somehow be 
turned into a proper feature?

I'm tempted to keep my patches as a temporary stop gap measure in some 
kernel on GitHub, so my users could use them to get NVidia+NVidia or at 
least old AMD+AMD setups with radeon-kms + ati-ddx working well enough 
for their research work until some proper solution comes around. But if 
you think there is some major way how this could blow up, corrupt data, 
hang/crash during normal use then better not. I don't know how many of 
my users have such systems, as my advice to them so far was to "stay the 
hell away from anything with hybrid graphics/Optimus/Enduro in its name 
if they value their work". Now i could change my purchase advice to 
"anything hybrid with a Intel iGPU is probably ok in terms of 
correctness/timing/performance for not too demanding performance needs".

> The latest versions of DCE support scanning out from GTT, so that might
> be a good solution at least for Carrizo and newer APUs, not sure it
> makes sense for dGPUs though.

That would be good to have. But that means DCE-11 or later only? What is 
the constraint on older parts, does it need contiguous memory? I 
personally don't care about the dGPU case, i only use these dGPUs for 
testing because i don't have access to any real Enduro laptops with APUs.

-mario

>
>
>> AMD, as tested with dual Radeon HD-5770 seems to be fast as prime
>> importer/display gpu, but very slow as prime exporter/render offload,
>> e.g., taking 16 msecs to get a 1920x1080 framebuffer into RAM. Seems
>> that Mesa's blitImage function is the slow bit here. On r600 it seems
>> to draw a textured triangle strip to detile the gpu renderbuffer and
>> copy it into GTT. As drawing a textured fullscreen quad is normally
>> much faster, something special seems to be going on there wrt. DMA?
>
> Maybe the rasterization as two triangles results in bad PCIe bandwidth
> utilization. Using the asynchronous DMA engine for these transfers would
> probably be ideal, but having the 3D engine rasterize a single rectangle
> (either using the rectangle primitive or a large triangle with scissor)
> might already help.
>
>