[Bug 38800] glXSwapBuffersMscOML is slow on AMD Fusion but not on Intel 945 w/Atom

Wed Jul 6 18:02:28 PDT 2011

https://bugs.freedesktop.org/show_bug.cgi?id=38800

--- Comment #26 from Mario Kleiner <mario.kleiner at tuebingen.mpg.de> 2011-07-06 18:02:26 PDT ---
Ok, i looked at the patch. The way it is now it won't work reliably and is
vulnerable to the races we tried to remove in the current implementation.

First, as far as i understood, the pageflip ioctl() is supposed to be
asynchronous, non-blocking and fast:
<http://www.mail-archive.com/dri-devel@lists.sourceforge.net/msg39857.html>

With the patch, the ioctl() doesn't block until pageflip completion, but it
does add approx. 0.7 msecs for waiting until outside of vblank whenever a flip
is scheduled more than 1 frame into the future, because then the ioctl() is
usually called within vblank. It blocks in radeon_bo_wait() until all pending
rendering has completed. If you have a large amount of rendering to do, or
multiple clients rendering to multiple displays, i'd assume the blocking time
could be substantial. It would block the x-server's main loop for that time,
not just the actual client app which submits the rendering work. I don't know
how much real world sluggishness this adds to the GUI, compared to other
current sources of lag, but it doesn't sound like such a good idea to me to
block the x-server there to avoid a little bit of complexity if we don't need
to block and will have more displays in the future, esp. on 6-display radeon
gpu's. If you think about having swap group support for synchronized swaps
across different drawables or similar features in the future, then it would
also be better to stay away from blocking in the ioctl() itself.

If we move the pageflip programming from the vblank irq handler to the fence
irq handler we could avoid blocking in the ioctl, with basically the same
complexity as now.

The bigger problem for me is that the current patch doesn't wait reliably until
the flip has completed before it sends the pageflip completion event. Getting
accurate vblank timestamps is not the problem. That is already solved by the
drm_vblank_count_and_time() function. It returns a timestamp which corresponds
to start of scanout at the end of the most recent/current vblank, just as the
OML_sync_control spec requires. I tested r500, r600 and a HD-5000 eyefinity
4-display setup with special measurement equipment and the timestamps are
almost microsecond accurate, compensating for any irq delays.

The problem is making sure that we emit the pageflip completion event during
the video frame in which the pageflip completed, not one or multiple refresh
cycles later or earlier. Otherwise the timestamps are meaningless, even
dangerous for applications like mine that really must rely on them, and the
whole synchronization of rendering to bufferswaps and throttling is broken and
we'll have beautiful graphics glitches.

The proposed patch schedules radeon_crtc->unpin_work for the vblank irq
handler, then blocks on radeon_bo_wait() for an unknown amount of time, then
programs the flip. During this time a vblank irq could already do a too early
unpin and emission of a wrong pageflip completion event and timestamp.

The "wait until outside vblank" is not reliable. E.g., assume the ioctl() is
called a few microseconds before the vblank, detects it is not in vblank,
therefore goes ahead, gets delayed for some reason (interrupt, preemption by
higher prio thread, whatever) while the gpu enters the vblank, and then
programs the flip just inside the vblank -> boom.

As i had to learn, vbank irq's on radeon's can also happen a few scanlines
before the start of vblank, so it can happen that the check for pageflip
completion would run before the flip happens, detect that no flip happened, and
then the flip would happen a few microseconds later and go unnoticed for a
refresh cycle and we'd be back to 30 fps on a 60 hz display.

I'm not saying the proposed patch couldn't be fixed to handle this correctly,
but it would require a bit of extra locking and communication between the
ioctl() in process context and the vblank irq handler, and some protection of
the ioctl() against preemption, to avoid stalling the irq handler on some
shared locks etc. I'm not sure if the correctly behaving end result wouldn't be
as "complex" as the current implementation. The current method is simple in the
sense that all timing sensitive stuff happens sequentially in a single "irq
thread" - the gpu irq handler can't race against itself. There can be races
between the gpu and the cpu, which i think we took care of, but it avoids
possible additional races between scheduling code in process context and irq
context. It also tries to make sure that it emits the pageflip completion as
soon as possible and doesn't delay it by a frame just because of an unlucky
order of execution.

My proposal and Simons patch are very similar in spirit, just that i'd program
the flip from within the fence interrupt handler to avoid potential races
between ioctl and irq and to avoid blocking in the ioctl().

-- 
Configure bugmail: https://bugs.freedesktop.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.