[Intel-gfx] [PATCH] drm: Aggressively disable vblanks

Wed Dec 22 13:06:35 PST 2010

On Dec 22, 2010, at 6:23 PM, Jesse Barnes wrote:

> On Wed, 22 Dec 2010 05:36:13 +0100
> Mario Kleiner <mario.kleiner at tuebingen.mpg.de> wrote:
>
>>> -------------------------------------------------------------------- 
>>> --
>>>
>>> Message: 1
>>> Date: Mon, 20 Dec 2010 19:23:40 -0800
>>> From: Keith Packard <keithp at keithp.com>
>>> Subject: Re: [Intel-gfx] [PATCH] drm: Aggressively disable vblanks
>>> To: Andy Lutomirski <luto at MIT.EDU>, Jesse Barnes
>>> 	<jbarnes at virtuousgeek.org>,	Chris Wilson <chris at chris- 
>>> wilson.co.uk>,
>>> 	David Airlie <airlied at linux.ie>
>>> Cc: intel-gfx at lists.freedesktop.org, dri-devel at lists.freedesktop.org
>>> Message-ID: <yunfwtrrepv.fsf at aiko.keithp.com>
>>> Content-Type: text/plain; charset="us-ascii"
>>>
>>> On Mon, 20 Dec 2010 14:00:54 -0500, Andy Lutomirski <luto at MIT.EDU>
>>> wrote:
>>>
>>>> Enabling and disabling the vblank interrupt (on devices that
>>>> support it) is cheap.  So disable it quickly after each
>>>> interrupt.
>>>
>>> So, the concern (and reason for the original design) was that you
>>> might
>>> lose count of vblank interrupts as vblanks are counted differently
>>> while
>>> off than while on. If vblank interrupts get enabled near the  
>>> interrupt
>>> time, is it possible that you'll end up off by one somehow?
>>>
>>> Leaving them enabled for 'a while' was supposed to reduce the
>>> impact of
>>> this potential error.
>>>
>>> -- 
>>> keith.packard at intel.com
>>> -------------- next part --------------
>>> A non-text attachment was scrubbed...
>>> Name: not available
>>> Type: application/pgp-signature
>>> Size: 189 bytes
>>> Desc: not available
>>> URL: <http://lists.freedesktop.org/archives/dri-devel/attachments/
>>> 20101220/105a9fb6/attachment-0001.pgp>
>>>
>>> ------------------------------
>>>
>>> Message: 2
>>> Date: Mon, 20 Dec 2010 22:34:12 -0500
>>> From: Andrew Lutomirski <luto at mit.edu>
>>> Subject: Re: [Intel-gfx] [PATCH] drm: Aggressively disable vblanks
>>> To: Keith Packard <keithp at keithp.com>
>>> Cc: intel-gfx at lists.freedesktop.org, dri-devel at lists.freedesktop.org
>>> Message-ID:
>>> 	<AANLkTik-1zni1DdS6i+1ARoenx6p=9jQAAiA6t5uGg_K at mail.gmail.com>
>>> Content-Type: text/plain; charset=ISO-8859-1
>>>
>>> On Mon, Dec 20, 2010 at 10:23 PM, Keith Packard <keithp at keithp.com>
>>> wrote:
>>>> On Mon, 20 Dec 2010 14:00:54 -0500, Andy Lutomirski <luto at MIT.EDU>
>>>> wrote:
>>>>
>>>>> Enabling and disabling the vblank interrupt (on devices that
>>>>> support it) is cheap. ?So disable it quickly after each
>>>>> interrupt.
>>>>
>>>> So, the concern (and reason for the original design) was that you
>>>> might
>>>> lose count of vblank interrupts as vblanks are counted differently
>>>> while
>>>> off than while on. If vblank interrupts get enabled near the
>>>> interrupt
>>>> time, is it possible that you'll end up off by one somehow?
>>>
>>> So the race is:
>>>
>>> 1. Vblank happens.
>>> 2. vblank_get runs, reads hardware counter, and enables the  
>>> interrupt
>>> (and maybe not quite in that order)
>>> 3. Interrupt fires and increments the counter.  Now the counter is
>>> one too high.
>>>
>>> What if we read the hardware counter from the IRQ the first time  
>>> that
>>> it fires after being enabled?  That way, if the hardware and  
>>> software
>>> counters match (mod 2^23 or whatever the magic number is), we don't
>>> increment the counter.
>>>
>>>>
>>>> Leaving them enabled for 'a while' was supposed to reduce the
>>>> impact of
>>>> this potential error.
>>>>
>>>
>>> Fair enough,
>>>
>>> But five seconds is a *long* time, and anything short enough that  
>>> the
>>> interrupt actually gets turned off in normal use risks the same  
>>> race.
>>>
>>> --Andy
>>>
>>>
>>> ------------------------------
>>>
>>> Message: 3
>>> Date: Mon, 20 Dec 2010 19:47:11 -0800
>>> From: Keith Packard <keithp at keithp.com>
>>> Subject: Re: [Intel-gfx] [PATCH] drm: Aggressively disable vblanks
>>> To: Andrew Lutomirski <luto at mit.edu>
>>> Cc: intel-gfx at lists.freedesktop.org, dri-devel at lists.freedesktop.org
>>> Message-ID: <yunbp4frdmo.fsf at aiko.keithp.com>
>>> Content-Type: text/plain; charset="us-ascii"
>>>
>>> On Mon, 20 Dec 2010 22:34:12 -0500, Andrew Lutomirski
>>> <luto at mit.edu> wrote:
>>>
>>>> But five seconds is a *long* time, and anything short enough  
>>>> that the
>>>> interrupt actually gets turned off in normal use risks the same  
>>>> race.
>>>
>>> Right, so eliminating any race seems like the basic requirement. Can
>>> that be done?
>>>
>>> -- 
>>> keith.packard at intel.com
>>> -------------- next part --------------
>>> A non-text attachment was scrubbed...
>>> Name: not available
>>> Type: application/pgp-signature
>>> Size: 189 bytes
>>> Desc: not available
>>> URL: <http://lists.freedesktop.org/archives/dri-devel/attachments/
>>> 20101220/5ca3b674/attachment-0001.pgp>
>>>
>>> ------------------------------
>>>
>>> Message: 4
>>> Date: Mon, 20 Dec 2010 22:55:46 -0500
>>> From: Andrew Lutomirski <luto at mit.edu>
>>> Subject: Re: [Intel-gfx] [PATCH] drm: Aggressively disable vblanks
>>> To: Keith Packard <keithp at keithp.com>
>>> Cc: intel-gfx at lists.freedesktop.org, dri-devel at lists.freedesktop.org
>>> Message-ID:
>>> 	<AANLkTimk6RLkr8Lt76cR8ncLW_3kaX6Dqa+=id9_G-8C at mail.gmail.com>
>>> Content-Type: text/plain; charset=ISO-8859-1
>>>
>>> On Mon, Dec 20, 2010 at 10:47 PM, Keith Packard <keithp at keithp.com>
>>> wrote:
>>>> On Mon, 20 Dec 2010 22:34:12 -0500, Andrew Lutomirski
>>>> <luto at mit.edu> wrote:
>>>>
>>>>> But five seconds is a *long* time, and anything short enough that
>>>>> the
>>>>> interrupt actually gets turned off in normal use risks the same
>>>>> race.
>>>>
>>>> Right, so eliminating any race seems like the basic requirement.  
>>>> Can
>>>> that be done?
>>>
>>> I'll give it a shot.
>>>
>>> Do you know if there's an existing tool to call drm_wait_vblank from
>>> userspace for testing?  I know approximately nothing about libdrm or
>>> any userspace graphics stuff whatsoever.
>>>
>>> --Andy
>>>
>>>>
>>>> --
>>>> keith.packard at intel.com
>>>>
>>>
>>>
>>> ------------------------------
>>>
>>> Message: 5
>>> Date: Mon, 20 Dec 2010 20:03:53 -0800
>>> From: Jesse Barnes <jbarnes at virtuousgeek.org>
>>> Subject: Re: [Intel-gfx] [PATCH] drm: Aggressively disable vblanks
>>> To: Andrew Lutomirski <luto at mit.edu>
>>> Cc: intel-gfx at lists.freedesktop.org, dri-devel at lists.freedesktop.org
>>> Message-ID: <20101220200353.12479178 at jbarnes-desktop>
>>> Content-Type: text/plain; charset=US-ASCII
>>>
>>> On Mon, 20 Dec 2010 22:55:46 -0500
>>> Andrew Lutomirski <luto at mit.edu> wrote:
>>>
>>>> On Mon, Dec 20, 2010 at 10:47 PM, Keith Packard
>>>> <keithp at keithp.com> wrote:
>>>>> On Mon, 20 Dec 2010 22:34:12 -0500, Andrew Lutomirski
>>>>> <luto at mit.edu> wrote:
>>>>>
>>>>>> But five seconds is a *long* time, and anything short enough
>>>>>> that the
>>>>>> interrupt actually gets turned off in normal use risks the same
>>>>>> race.
>>>>>
>>>>> Right, so eliminating any race seems like the basic  
>>>>> requirement. Can
>>>>> that be done?
>>>>
>>>> I'll give it a shot.
>>>>
>>>> Do you know if there's an existing tool to call drm_wait_vblank  
>>>> from
>>>> userspace for testing?  I know approximately nothing about  
>>>> libdrm or
>>>> any userspace graphics stuff whatsoever.
>>>
>>> Yeah, libdrm has a test program called vbltest; it should do what  
>>> you
>>> need.
>>>
>>
>> Not so fast please! After a lot of review by Jesse, Dave and Chris
>> just merged a set of my patches into the drm-next (and the intel and
>> radeon kms drivers) to implement precise timestamping of vblank's and
>> pageflip completion and vblank counting for DRI2 and the
>> OML_sync_control extension. It also fixes (hopefully) almost all  
>> race-
>> conditions (that i could find or think of) related to vblank irq on/
>> off, a few of them surprising and due to "funny" behaviour of some
>> gpu's when you enable/disable vblanks (e.g., radeon's spontaneously
>> firing a vblank irq in the middle of a scanout cycle when vblank
>> irq's get enabled, or firing the irq sometimes shortly before a
>> vblank instead of in the vblank).
>>
>> There's one tiny race left in the vblank off path, which i wanted to
>> address during the next weeks. Also i need to implement support for
>> nouveau. After that we could simply reduce the vblank off timeout to
>> something small like 50 msecs. Or use Andrew's heuristic on top of  
>> this.
>>
>> In any case, please check against the drm-next branch. I think your
>> patches touch/conflict with most of the areas in drm that are
>> modified in drm-next. At least my users need a very high level of
>> precision and robustness in vblank counting and timestamping for
>> neuro-science applications and similar stuff.
>
> To preserve all of that, the easiest way to reduce the amount of time
> the vblank remains enabled would probably be to reduce the vblank off
> timeout.  Just make it 100ms or so since that will make sure it stays
> on for a few frames at least.
>
> Mario, for cases where you have very intermittent waits, I think you
> added a method to disable the timer altogether?  That combined with a
> reduced timeout seems like it could make everyone happy...
>
> -- 
> Jesse Barnes, Intel Open Source Technology Center

There's a new drm module parameter for selecting the timeout: echo 50  
 > /sys/module/drm/parameters/vblankoffdelay
would set the timeout to 50 msecs. A setting of zero will disable the  
timer, so vblank irq's would stay on all the time.

The default setting is still 5000 msecs as before, but reducing this  
to 100 msecs wouldn't be a real problem imho. At least i didn't  
observe any miscounting during extensive testing with 100 msecs.

The patches in drm-next fix a couple of races that i observed on  
intel and radeon during testing and a few that i didn't see but that  
i could imagine happening. It tries to make sure that the saved final  
count at vblank irq disable of the software vblank_count and the gpu  
counter are consistent - no off by one errors. They also try to  
detect and filter out spurious vblank interrupts at vblank enable  
time, e.g., on the radeon.

There's still one possible race in the disable path which i will try  
to fix: We don't know when exactly the hardware counter increments  
wrt. the processing of the vblank interrupt - it could increment a  
few (dozen/hundred) microseconds before or after the irq handler  
runs, so if you happen to query the hardware counter while the gpu is  
inside the vblank you can't be sure if you picked up the old count or  
the new count for that vblank.

This only matters during vblank disable. For that reason it's not  
such a good idea to disable vblank irq's from within the vblank irq  
handler. I tried that and it didn't work well --> When doing it from  
within irq you basically maximize the chance of hitting the time  
window when the race can happen. Delaying within the irq handler by a  
millisecond would fix that, but that's not what we want.

Having the disable in a function triggered by a timer like now is the  
most simple solution i could come up with. There we can burn a few  
dozen microseconds if neccessary to remove this race.

There could be other races that i couldn't think of and that i didn't  
see during my testing with my 2 radeons and 1 intel gpu. Therefore i  
think we should keep vblank irq's enabled for at least 2 or maybe 3  
refresh cycles if they aren't used by anyone. Apps that want to  
schedule swaps very precisely and the ddx/drm/kms-driver itself do  
multiple queries in quick succession for a typical swapbuffers call,  
i.e., drm_vblank_get() -> query -> drm_vblank_put(), so on an  
otherwise idle graphics system the refcount will toggle between zero  
and one multiple times during a swap, usually within a few  
milliseconds. If the timeout is big enough so that irq's don't get  
disabled within such a sequence of toggles, even if there's a bit of  
scheduling delay for the x-server or client, then a client will see  
at least consistent vblank count during a swap, even if there are  
still some races hiding somewhere that we don't handle properly. That  
should be good enough, and paranoid clients can always increase the  
timeout value or set it to infinite.

E.g., my toolkit schedules a swap for a specific system time like this:

1. glXGetSyncValuesOML(... &base_msc, &base_ust);
2. calculate target_msc based on user provided swap deadline t and  
(base_msc, base_ust) as a baseline.
3. glXSwapBuffersMscOML(...., target_msc,...);
4. glXWaitForSbcOML() or use Intel_swap_events for retrieving the  
true msc and ust of swap completion.

=> Doesn't matter if there would be an off-by-one error in vblank  
counting due to an unknown race, as long as it doesn't happen between  
1. and 4. As long as there aren't any client/x-server scheduling  
delays between step 1 and 3 of more than /sys/module/drm/parameters/ 
vblankoffdelay msecs, nothing can go wrong even if there are race  
conditions left in that area.

=> 50-100 msecs as new default would be probably good enough and at  
the same time prevent the "blinking cursor keeps vblank irq's on all  
the time" problem.

I didn't reduce the timeout in the current patches because the  
filtering for race-conditions and other gpu funkyness relies on  
somewhat precise vblank timestamps and the timestamping hooks aren't  
yet implemented in the nouveau kms. Maybe i manage to get this  
working over christmas. Patches to nouveau would be simple, i just  
don't know the mmio register addresses for crtc scanout position on  
nvidia gpu's.

-mario

*********************************************************************
Mario Kleiner
Max Planck Institute for Biological Cybernetics
Spemannstr. 38
72076 Tuebingen
Germany

e-mail: mario.kleiner at tuebingen.mpg.de
office: +49 (0)7071/601-1623
fax:    +49 (0)7071/601-616
www:    http://www.kyb.tuebingen.mpg.de/~kleinerm
*********************************************************************
"For a successful technology, reality must take precedence
over public relations, for Nature cannot be fooled."
(Richard Feynman)