"Fixes" for page flipping under PRIME on AMD & nouveau

Fri Aug 26 20:07:17 UTC 2016

On 08/18/2016 04:32 AM, Michel Dänzer wrote:
> On 18/08/16 08:51 AM, Mario Kleiner wrote:
>>
>> That's what the ati-ddx/amdgpu-ddx does at the moment, as it detects the
>> mismatch in tiling flags and uses the DRI3/Present copy path instead of
>> the pageflip path. The problem is that the servers Present
>> implementation doesn't request a vsync'ed start of the copy operation [...]
>
> It waits for vblank before starting the copy.
>

Yes, a vblank event triggers the present_execute in the server. But all 
the latency from vblank event dispatch to the copy command packet 
hitting the gpu is still way too bad to avoid tearing. I tried again and 
couldn't find a single intel/amd/nvidia gpu here that doesn't tear more 
or less badly depending on load with DRI3/Present Copyswaps. Even 
tearfree wouldn't be good enough for my kind of applications as crucial 
timing/timestamps could still be off frequently by at least 1 frame.

>
>> There is this other approach from NVidia's Alex Goins for their
>> proprietary driver, whose patches landed in the X-Server 1.19 master
>> branch a couple of weeks ago. I haven't read his patches in detail yet,
>> and i so far couldn't successfully test them with the reference
>> implementation in modesetting ddx 1.19. Afaik there the display gpu
>> exports a pair of scanout friendly, page flipping compatible dmabufs (i
>> assume linear, contiguous, accessible by the display engines),
>
> FWIW, that wouldn't be possible with our "older" GPUs which can't scan
> out from GTT: A BO can be either shared with another GPU or scanout
> friendly, not both at the same time.
>

Ok, good to know.

>
>> and the offload gpu imports those and renders into them. That saves
>> one extra copy, so should be somewhat more efficient.
>
> Using two shared buffers actually isn't as efficient as possible wrt
> inter-GPU bandwidth.
>

Out of interest, why? You'd have only one detiling copy VRAM -> RAM? Or 
is it about switching some kind of GTT mappings with two buffers that is 
inefficient?

>
>> Setting it up seems to be more involved and less flexible though. So far
>> i couldn't make it work here for testing. Maybe bugs, maybe mistakes on
>> my side, maybe i just have the wrong hardware for it.
>
> Yeah, my impression has been it's a rather complicated solution geared
> towards the Intel iGPU + proprietary nVidia use case.
>
>

Setting up output source/output sink is not fun, as i learned now, 
rather clumsy and complex compared to render offload. I hope the real 
thing will come with some fool-proof one-click setup GUI, otherwise i 
don't have great hopes, given the technical skill level of my users. I 
still didn't manage to get it working, not even with the new Nvidia 
proprietary beta drivers on a real Optimus laptop.

-mario