[RFC] DRI2 synchronization and swap bits

Sat Nov 7 23:16:51 PST 2009

On Nov 2, 2009, at 5:35 PM, Jesse Barnes wrote:
>
> Thanks a lot for taking time to go through this stuff, it's exactly  
> the
> kind of feedback I was hoping for.

Hello again

I'm relieved that i didn't screw up and annoy you already with my  
first post, so i'll continue to test my boundaries ;-)

> Doing the wakeups within a millisecond should definitely be possible,
> I don't expect the context switch between display server and client
> would be *that* high of a cost (but as I said I'll benchmark).

I don't expect that either. My comment was just to reinforce that  
very low latency matters for at least our class of applications. I'm  
currently benchmarking our toolkit wrt. timing precision and latency  
on a few different machine/gpu/os combinations. In case such numbers  
from other implementations (Linux with the proprietary drivers, OS/X,  
Windows) are interesting to you, let me know.

>> I don't like this idea about entirely fake numbers and like to vote
>> for a solution that is as close as possible to the non-redirected
>> case.

> The raw numbers will always be exposed to the compositor and probably
> to applications via an opt-out mechanism (to be defined still, we  
> don't
> even have the extra compositor protocol defined).

Happy to hear that.

>> Unreliable UST timestamps would make the whole OML_sync_control
>> extension almost useless for us and probably other applications that
>> require good sync e.g, btw. video and audio streams, so i'd ask you
>> politely for improvements here.
>
> Definitely; these are just bugs, I certainly didn't design it to  
> behave
> this way! :)

Assumed that :). Currently 1.5% of our users are on Linux and i'd  
love to persuade a few more to adopt Linux in the next year. I just  
realized that helping to improve the Linux graphics stack in areas  
that matter to us makes more sense than doing what i did for all  
operating systems since years - trying to cope with limitations and  
driver bugs by use of weird hacks in our userspace application, the  
best i can do on OS/X and Windows.

>
>> I guess one (simple from the viewpoint of  a non-kernel hacker?) way
>> would be to always timestamp the vblank in the drm_handle_vblank()
>> routine, immediately after incrementing the vblank_count, probably
>> protecting both the timestamp acquisition and vblank increment by
>> one spinlock, so both get updated atomically? Then one could maybe
>> extend  drm_vblank_count() to readout and return vblank count and
>> corresponding timestamp simultaneously under protection of the lock?
>> Or any other way to provide the timestamp together with the vblank
>> count in an atomic fashion to the calling code in
>> drm_queue_vblank_event(), drm_queue_vblank_event() and
>> drm_handle_vblank_events()?
>
> Yep, that would work and should be a fairly easy change.

I spent a bit more time thinking about this, i also read about the  
available synchronization primitives and started to code the  
following possible implementation. Again apologies if i'm stating the  
totally obvious, or stuff that's been done or planned already.

My proposal to use a spinlock was probably  rather stupid. Because of  
glXGetSyncValuesOML() -> I830DRI2GetMSC -> drmWaitVBlank ->  
drm_wait_vblank -> drm_vblank_count(), if multiple clients call  
glXGetSyncValuesOML() frequently, e.g., in a polling loop, i assume  
this could cause quite a bit of contention on a spinlock that must be  
acquired with minimal delay from the vblank irq handler. According to  
<http://rt.wiki.kernel.org/index.php/RT_PREEMPT_HOWTO>, critical  
sections protected by spinlock_t are preemptible if one uses a  
realtime kernel with preempt-rt patches applied, something i'd expect  
my users to do frequently. Maybe i overlook something, but this  
sounds unhealthy if it happens while drm_vblank_count() holds the lock?

A lockless method would be the better solution. Seqlocks seem to be a  
good fit? There's only one writer per crtc, drm_handle_vblank() (and  
occassionally drm_update_vblank_count() if vblank irqs get reenabled)  
which only writes infrequently (60 - 200 times per second on typical  
displays). There can be many readers which can read very frequently,  
and the datastructure to read is relatively simple and free of  
pointers, so this fits the model of seqlocks. Maybe one could even do  
with the versions that don't disable irqs, i.e., write_seqlock()  
instead of write_seqlock_irqsave()? Documentation says that one  
should use the irq-safe versions if the seqlock might be accessed  
from an interrupt handler. I looked at the implementation of seqlocks  
and as far as i can see, deadlock can only happen if a writer gets  
preempted by an irq handler that then tries to either write or read  
the seqlock itself? But a _vblank_seqlock would only get accessed for  
write access from the interrupt handler for a given crtc. The only  
other place of write access, drm_update_vblank_count(), gets called  
infrequently and within a spin_lock_irqsave(&dev->vbl_lock,  
irqflags); protected section where interrupts are disabled anyway.

Btw., when looking at the code in drm_irq.c in the current linux-next  
tree, i saw that drm_handle_vblank_events() does a e->event.sequence  
= seq; assignment with the current seq vblank number when retiring an  
event, but the special shortcut path in drm_queue_vblank_event(),  
which retires events immediately without queuing them if the  
requested vblank number has been reached or exceeded already, does  
not do an update e->event.sequence = seq; with the most recent seq  
vblank number that triggered this early retirement. This looks  
inconsistent to me, could this be a bug?

The simple seqlock implementation might be too simple though and a  
ringbuffer that holds multiple hundred recent vblank timestamp  
samples might be better.

The problem is the accuracy of glXGetMscRateOML(). This value -  
basically the duration of a video refresh interval - gets calculated  
from the current video mode timing, i.e., dotclock, HTotal and  
VTotal. This value is only useful for userspace applications like my  
toolkit under the assumption that both the dotclock of the GPU and  
the current system clock (TSC / HPET / APIC timer / ...) are  
perfectly accurate and drift-free. In reality, both clocks are  
imperfect and drift against each other, therefore the returned  
nominal value of glXGetMscRateOML() is basically always a bit wrong/ 
inaccurate wrt. system time as used by userspace applications. Our  
app therefore determines the "real" refresh duration by a calibration  
loop of multiple seconds duration at startup. This works ok, but it  
increases startup time, can't take slow clock drift over the course  
of a session into account, because i can't recalibrate during a  
session, and the calibration is also not perfect due to the timing  
noise (preemption, scheduling jitter, wakeup latency after  
swapbuffers etc.) that affects a measurement loop in userspace.

A better approach would be for Linux to measure the current video  
refresh interval over a certain time window, e.g., computing a moving  
average over a few seconds. This could be done if the vblank  
timestamps are logged into a ringbuffer. The ringbuffer would allow  
for lock-free readout of the most recent vblank timestamp from  
drm_vblank_count(). At the same time the system could look at all  
samples in the ringbuffer to compute the real duration of a video  
refresh interval as a average over the deltas between samples in the  
ringbuffer and provide an accurate and current estimate of  
glXGetMscRateOML() that would be better than anything we can do in  
userspace.

The second problem is how to reinitialize the current vblank  
timestamp in drm_update_vblank_count() when vblank interrupts get  
reenabled after they've been disabled for a long period of time?

One generic way to reinitialize would be to calculate elapsed time  
since last known vblank timestamp from the computed vblank count  
"diff" by multiplying the count with the known duration of the video  
refresh interval. In that case, an accurate estimate of  
glXGetMscRateOML would be important, so a ringbuffer with samples  
would probably help.

Here is another proposal that i would love to see in Linux:

A good method to find the accurate time of the last vblank  
immediately, irrespective of possible delays in irq dispatching, is  
to use the current scanout position of the crtc as a high resolution  
clock that counts time since start of the vertical blank interval.  
Afaik basically all GPU's have a register that allows to read out the  
currently scanned out scanline. If the exact duration of a video  
refresh interval 'refreshduration' is known by measurement, the  
current 'scanline' is known by a register read, the total height of  
the display vtotal is known, and one has a timestamp of current  
system time tsystem from do_gettimeofday(), one can conceptually  
implement this pseudo-code:

scanline = dev->driver->getscanline(dev, crtc);
tsystem = do_gettimeofday(...);
tvblank = tsystem - refreshduration * (scanline / vtotal);

(One has to measure scanline relative to the first line inside the  
vblank area though, and the math might be a bit more complex, due to  
the lack of floating point arithmetic in the kernel?).

This would allow to quickly reinitialize the vblank timestamp in  
drm_update_vblank_count() and to provide vblank timestamps from  
inside drm_handle_vblank() which are robust against interrupt  
dispatch latency. It would require a new callback function into the  
GPU specific driver, e.g., dev->driver->getscanline(dev, crtc). One  
could have a simple do_gettimeofday(&tvblank) as fallback for drivers  
that don't implement the new callback. Maybe one could even implement  
a dev->driver->getvblanktime(dev, crtc); callback that executes the  
above lines inside one function and optionally allows use of more  
clever GPU specific strategies like GPU internal clocks and snapshot  
registers?

Our app uses this "beamposition timestamping" in userspace to get rid  
of most of the scheduling/wakeup delay and jitter after swap  
completion on Windows and MacOS/X. Both systems provide a function to  
query the current scanout position. It is so far the only method that  
allows us to achieve sub-millisecond precision for our timestamps. It  
also allows us to provide a timestamp for vbl onset and a timestamp  
for start of scanout of a new frame. If Linux could do this within  
its timestamping code by default, it would be even more accurate. Our  
userspace code can get preempted at the wrong moment, but kernel code  
inside the irq handler should be more robust. Also it would allow a  
more precise implementation of the UST timestamp according to the  
OML_sync_control spec, as that spec asks for UST being the time of  
start of scanout of the first scanline of a new video frame, instead  
of start of vblank.

Let me know what you think about this,
-mario

*********************************************************************
Mario Kleiner
Max Planck Institute for Biological Cybernetics
Spemannstr. 38
72076 Tuebingen
Germany

e-mail: mario.kleiner at tuebingen.mpg.de
office: +49 (0)7071/601-1623
fax:    +49 (0)7071/601-616
www:    http://www.kyb.tuebingen.mpg.de/~kleinerm
*********************************************************************
"For a successful technology, reality must take precedence
over public relations, for Nature cannot be fooled."
(Richard Feynman)