[RFC] DRI2 synchronization and swap bits
Mario Kleiner
mario.kleiner at tuebingen.mpg.de
Sat Nov 7 23:16:51 PST 2009
On Nov 2, 2009, at 5:35 PM, Jesse Barnes wrote:
>
> Thanks a lot for taking time to go through this stuff, it's exactly
> the
> kind of feedback I was hoping for.
Hello again
I'm relieved that i didn't screw up and annoy you already with my
first post, so i'll continue to test my boundaries ;-)
> Doing the wakeups within a millisecond should definitely be possible,
> I don't expect the context switch between display server and client
> would be *that* high of a cost (but as I said I'll benchmark).
I don't expect that either. My comment was just to reinforce that
very low latency matters for at least our class of applications. I'm
currently benchmarking our toolkit wrt. timing precision and latency
on a few different machine/gpu/os combinations. In case such numbers
from other implementations (Linux with the proprietary drivers, OS/X,
Windows) are interesting to you, let me know.
>> I don't like this idea about entirely fake numbers and like to vote
>> for a solution that is as close as possible to the non-redirected
>> case.
> The raw numbers will always be exposed to the compositor and probably
> to applications via an opt-out mechanism (to be defined still, we
> don't
> even have the extra compositor protocol defined).
Happy to hear that.
>> Unreliable UST timestamps would make the whole OML_sync_control
>> extension almost useless for us and probably other applications that
>> require good sync e.g, btw. video and audio streams, so i'd ask you
>> politely for improvements here.
>
> Definitely; these are just bugs, I certainly didn't design it to
> behave
> this way! :)
Assumed that :). Currently 1.5% of our users are on Linux and i'd
love to persuade a few more to adopt Linux in the next year. I just
realized that helping to improve the Linux graphics stack in areas
that matter to us makes more sense than doing what i did for all
operating systems since years - trying to cope with limitations and
driver bugs by use of weird hacks in our userspace application, the
best i can do on OS/X and Windows.
>
>> I guess one (simple from the viewpoint of a non-kernel hacker?) way
>> would be to always timestamp the vblank in the drm_handle_vblank()
>> routine, immediately after incrementing the vblank_count, probably
>> protecting both the timestamp acquisition and vblank increment by
>> one spinlock, so both get updated atomically? Then one could maybe
>> extend drm_vblank_count() to readout and return vblank count and
>> corresponding timestamp simultaneously under protection of the lock?
>> Or any other way to provide the timestamp together with the vblank
>> count in an atomic fashion to the calling code in
>> drm_queue_vblank_event(), drm_queue_vblank_event() and
>> drm_handle_vblank_events()?
>
> Yep, that would work and should be a fairly easy change.
I spent a bit more time thinking about this, i also read about the
available synchronization primitives and started to code the
following possible implementation. Again apologies if i'm stating the
totally obvious, or stuff that's been done or planned already.
My proposal to use a spinlock was probably rather stupid. Because of
glXGetSyncValuesOML() -> I830DRI2GetMSC -> drmWaitVBlank ->
drm_wait_vblank -> drm_vblank_count(), if multiple clients call
glXGetSyncValuesOML() frequently, e.g., in a polling loop, i assume
this could cause quite a bit of contention on a spinlock that must be
acquired with minimal delay from the vblank irq handler. According to
<http://rt.wiki.kernel.org/index.php/RT_PREEMPT_HOWTO>, critical
sections protected by spinlock_t are preemptible if one uses a
realtime kernel with preempt-rt patches applied, something i'd expect
my users to do frequently. Maybe i overlook something, but this
sounds unhealthy if it happens while drm_vblank_count() holds the lock?
A lockless method would be the better solution. Seqlocks seem to be a
good fit? There's only one writer per crtc, drm_handle_vblank() (and
occassionally drm_update_vblank_count() if vblank irqs get reenabled)
which only writes infrequently (60 - 200 times per second on typical
displays). There can be many readers which can read very frequently,
and the datastructure to read is relatively simple and free of
pointers, so this fits the model of seqlocks. Maybe one could even do
with the versions that don't disable irqs, i.e., write_seqlock()
instead of write_seqlock_irqsave()? Documentation says that one
should use the irq-safe versions if the seqlock might be accessed
from an interrupt handler. I looked at the implementation of seqlocks
and as far as i can see, deadlock can only happen if a writer gets
preempted by an irq handler that then tries to either write or read
the seqlock itself? But a _vblank_seqlock would only get accessed for
write access from the interrupt handler for a given crtc. The only
other place of write access, drm_update_vblank_count(), gets called
infrequently and within a spin_lock_irqsave(&dev->vbl_lock,
irqflags); protected section where interrupts are disabled anyway.
Btw., when looking at the code in drm_irq.c in the current linux-next
tree, i saw that drm_handle_vblank_events() does a e->event.sequence
= seq; assignment with the current seq vblank number when retiring an
event, but the special shortcut path in drm_queue_vblank_event(),
which retires events immediately without queuing them if the
requested vblank number has been reached or exceeded already, does
not do an update e->event.sequence = seq; with the most recent seq
vblank number that triggered this early retirement. This looks
inconsistent to me, could this be a bug?
The simple seqlock implementation might be too simple though and a
ringbuffer that holds multiple hundred recent vblank timestamp
samples might be better.
The problem is the accuracy of glXGetMscRateOML(). This value -
basically the duration of a video refresh interval - gets calculated
from the current video mode timing, i.e., dotclock, HTotal and
VTotal. This value is only useful for userspace applications like my
toolkit under the assumption that both the dotclock of the GPU and
the current system clock (TSC / HPET / APIC timer / ...) are
perfectly accurate and drift-free. In reality, both clocks are
imperfect and drift against each other, therefore the returned
nominal value of glXGetMscRateOML() is basically always a bit wrong/
inaccurate wrt. system time as used by userspace applications. Our
app therefore determines the "real" refresh duration by a calibration
loop of multiple seconds duration at startup. This works ok, but it
increases startup time, can't take slow clock drift over the course
of a session into account, because i can't recalibrate during a
session, and the calibration is also not perfect due to the timing
noise (preemption, scheduling jitter, wakeup latency after
swapbuffers etc.) that affects a measurement loop in userspace.
A better approach would be for Linux to measure the current video
refresh interval over a certain time window, e.g., computing a moving
average over a few seconds. This could be done if the vblank
timestamps are logged into a ringbuffer. The ringbuffer would allow
for lock-free readout of the most recent vblank timestamp from
drm_vblank_count(). At the same time the system could look at all
samples in the ringbuffer to compute the real duration of a video
refresh interval as a average over the deltas between samples in the
ringbuffer and provide an accurate and current estimate of
glXGetMscRateOML() that would be better than anything we can do in
userspace.
The second problem is how to reinitialize the current vblank
timestamp in drm_update_vblank_count() when vblank interrupts get
reenabled after they've been disabled for a long period of time?
One generic way to reinitialize would be to calculate elapsed time
since last known vblank timestamp from the computed vblank count
"diff" by multiplying the count with the known duration of the video
refresh interval. In that case, an accurate estimate of
glXGetMscRateOML would be important, so a ringbuffer with samples
would probably help.
Here is another proposal that i would love to see in Linux:
A good method to find the accurate time of the last vblank
immediately, irrespective of possible delays in irq dispatching, is
to use the current scanout position of the crtc as a high resolution
clock that counts time since start of the vertical blank interval.
Afaik basically all GPU's have a register that allows to read out the
currently scanned out scanline. If the exact duration of a video
refresh interval 'refreshduration' is known by measurement, the
current 'scanline' is known by a register read, the total height of
the display vtotal is known, and one has a timestamp of current
system time tsystem from do_gettimeofday(), one can conceptually
implement this pseudo-code:
scanline = dev->driver->getscanline(dev, crtc);
tsystem = do_gettimeofday(...);
tvblank = tsystem - refreshduration * (scanline / vtotal);
(One has to measure scanline relative to the first line inside the
vblank area though, and the math might be a bit more complex, due to
the lack of floating point arithmetic in the kernel?).
This would allow to quickly reinitialize the vblank timestamp in
drm_update_vblank_count() and to provide vblank timestamps from
inside drm_handle_vblank() which are robust against interrupt
dispatch latency. It would require a new callback function into the
GPU specific driver, e.g., dev->driver->getscanline(dev, crtc). One
could have a simple do_gettimeofday(&tvblank) as fallback for drivers
that don't implement the new callback. Maybe one could even implement
a dev->driver->getvblanktime(dev, crtc); callback that executes the
above lines inside one function and optionally allows use of more
clever GPU specific strategies like GPU internal clocks and snapshot
registers?
Our app uses this "beamposition timestamping" in userspace to get rid
of most of the scheduling/wakeup delay and jitter after swap
completion on Windows and MacOS/X. Both systems provide a function to
query the current scanout position. It is so far the only method that
allows us to achieve sub-millisecond precision for our timestamps. It
also allows us to provide a timestamp for vbl onset and a timestamp
for start of scanout of a new frame. If Linux could do this within
its timestamping code by default, it would be even more accurate. Our
userspace code can get preempted at the wrong moment, but kernel code
inside the irq handler should be more robust. Also it would allow a
more precise implementation of the UST timestamp according to the
OML_sync_control spec, as that spec asks for UST being the time of
start of scanout of the first scanline of a new video frame, instead
of start of vblank.
Let me know what you think about this,
-mario
*********************************************************************
Mario Kleiner
Max Planck Institute for Biological Cybernetics
Spemannstr. 38
72076 Tuebingen
Germany
e-mail: mario.kleiner at tuebingen.mpg.de
office: +49 (0)7071/601-1623
fax: +49 (0)7071/601-616
www: http://www.kyb.tuebingen.mpg.de/~kleinerm
*********************************************************************
"For a successful technology, reality must take precedence
over public relations, for Nature cannot be fooled."
(Richard Feynman)
More information about the xorg-devel
mailing list