[RFC] Reduce idle vblank wakeups

Thu Nov 17 12:36:23 PST 2011

On Nov 17, 2011, at 3:19 AM, Matthew Garrett wrote:

> On Thu, Nov 17, 2011 at 01:26:37AM +0100, Mario Kleiner wrote:
>> On Nov 16, 2011, at 7:48 PM, Matthew Garrett wrote:
>>> I'll admit that I'm struggling to understand the issue here. If the
>>> vblank counter is incremented at the time of vblank (which isn't the
>>> case for radeon, it seems, but as far as I can tell is the case for
>>> Intel) then how does ping-ponging the IRQ matter?
>>> vblank_disable_and_save() appears to handle this case.
>>>
>>
>> The drm software vblank counter which is used for scheduling swaps
>> etc. gets incremented in the gpu's vblank irq handler. The gpu's
>> hardware counter gets incremented somewhere inside the vblank
>> interval. The increments don't happen at the same point in time. The
>> race is between the vblank on/off code, which gets scheduled either
>> by the timeout timer for the "off" case, or by usercode for the "on"
>> case and the gpu's hardware vblank counter.
>
> Yes, that makes sense given the current design.
>
>> The existing code uses a lock (vblank_time_lock) to prevent some
>> races between itself and the vblank irq handler, but it can't "lock"
>> the gpu, so if the enable/disable code executes while the gpu is
>> scanning out the vblank interval, it is undefined if the final
>> vblank count will be correct or off by one. Vblank duration is
>> somewhere up to 4.5% of refresh interval duration, so there's your
>> up to 4% chance of getting it wrong.
>
> Well presumably by "undefined" we really mean "hardware-specific" -  
> for
> any given well-functioning hardware I'd expect the answer to be
> well-defined. Like I said, with Radeon we have the option of  
> triggering
> an interrupt at the point where the hardware counter is actually
> incremented. If that then shuts down the vblank interrupt then we  
> should
> have a well-defined answer.
>

Yes, "hardware-specific". It is only "undefined" from the point of  
view of the drm core in the sense that the code in the core is  
currently agnostic to what the underlying gpu/driver does.

...

>>
>> Use the crtc scanout position queries together with the vblank
>> counter queries inside some calibration loop, maybe executed after
>> each modeset, to find out the scanline range in which the hardware
>> vblank counter increments -- basically a forbidden range of scanline
>> positions where the race would happen. Then at each vblank off/on,
>> query scanout position before and after the hw vblank counter query.
>> If according to the scanout positions the vblank counter query
>> happened within the forbidden time window, retry the query. With a
>> well working calibration that should add no delay in most cases and
>> a delay to the on/off code of a few dozen microseconds (=duration of
>> a few scanlines) worst case.
>
> Assuming we're sleeping rather than busy-looping, that's certainly ok.
> My previous experiments with radeon indicated that the scanout irq was
> certainly not entirely reliable - on the other hand, I was trying  
> to use
> it for completing memory reclocking within the vblank interval. It was
> typically still within a few scanlines, so a sanity check there  
> wouldn't
> pose too much of a problem.
>

Sleeping in the timer triggered off path would be ok, but in the on- 
path we need to be relatively fast, so i think some kind of cpu  
friendly busy waiting (cpu_relax() or something like that, if i  
understand its purpose?) probably would be neccessary.

>>> I've no problem with all of this work being case by case.
>>
>> Me neither. I just say if you'd go to the extreme and disable vblank
>> irq's immediately with zero delay, without reliably fixing the
>> problem i mentioned, you'd get those off by one counts, which would
>> be very bad for apps that rely on precise timing. Doing so would
>> basically make the whole oml_sync_control implementation mostly
>> useless.
>
> Right. My testing of sandybridge suggests that there wasn't a problem
> here - even with the ping-ponging I was reliably getting 60 interrupts
> in 60 seconds, with the counter incrementing by 1 each time. I  
> certainly
> wouldn't push to enable it elsewhere without making sure that the
> results are reliable.
>

How hard did you test it? Given that the off-by-one is a race- 
condition, there is a good chance you miss the bad cases due to lucky  
timing. I think one way to test the ping-pong case might be a tight  
loop with calls to glXGetSyncValuesOML(), or at a lower level a tight  
loop to the drmWaitVblank() ioctl, which is what glXGetSyncValuesOML 
() does.

That would be a sequence of drm_vblank_get() -> irq's on, reinitalize  
vblank counter and timestamps -> Fetch values -> drm_vblank_put() ->  
irq's off etc. If you run this with a couple of thousand calls per  
second there would be some coverage of the vblank period. Either that  
or some funkyness because we wouldn't exercise the vblank irq anymore  
for vblank counting and miss something there for some reason...

>>> If vblanks are disabled and then re-enabled between 1 and 3,
>>> what's the
>>> negative outcome?
>>
>> 1. glXGetSyncValuesOML() gives you the current vblank count and a
>> timestamp of when exactly scanout of the corresponding video frame
>> started. These values are the baseline for any vblank based swap
>> scheduling or any other vblank synchronized actions, e.g., via
>> glXWaitForMscOML().
>>
>> 2. App calculates stuff based on 1. but gets preempted or
>> experiences scheduling delay on return from 1. - or the x-server
>> gets preempted or scheduled away. Whatever, time passes...
>>
>> 3. glXSwapBuffersMscOML() or glXWaitForMscOML() is used to schedule
>> something to occur at a target vblank count, based on the results of
>> 1.
>>
>> As long as vblanks don't get disabled between 1 and 3, everything is
>> consistent. Scheduling delays/preemption of more than a few (dozen)
>> milliseconds worst case are very rare, so a lowered vblank off delay
>> of even one or two refresh cycles would solve the problem. I assume
>> that most toolkits with needs for timing precision would follow a
>> three step strategy like this one.
>>
>> If vblank irq's get disabled immediately after step 1. and reenabled
>> at step 3., and this triggers an off by one error due to a race,
>> then the calculated target vblank count from steps 1/2 is invalid
>> and useless for step 3 and many vision scientists which just started
>> to make friendship with Linux lately will make very sad faces. Due
>> to the nature of many of the stimulation paradigms used, there is
>> also a high chance that many of the enables and disables would
>> happen inside or very close to the problematic vblank interval when
>> off by one error is most likely.
>
> Ok, so as long as we believe that we're reliably reading the hardware
> counter and not getting off by one, we should be ok? I just want to be
> clear that this is dependent on hardware behaviour rather than being
> absolutely inherent :)
>

Yes. The way we need to find the final/new software/hardware vblank  
count at irq off /on time must be consistent with what would have  
happened if the counters would have been incremented by "just another  
vblank irq".

But apart from this, i would assume that even if we can remove all  
races, it probably would still make sense to always have some small  
vblank off delay, even if it is only half a video refresh cycle?  
During execution of a bufferswap or a three-step procedure like my  
toolkit does, or a desktop composition pass (at least the way compiz  
does it, didn't check other compositors), we will usually have  
multiple drm_vblank_get() -> drm_vblank_put() invocations in quick  
succession, triggered by the ddx inside the x-server and the drm code  
during swap preparation and swap completion. Without a delay of at  
least a few milliseconds the code would turn on and off the irq's  
multiple times within a few milliseconds, each time involving  
locking, transactions with the gpu, some memory barriers, etc. As  
long as the desktop is busy with some OpenGL animations, vblank irq's  
will neccessarily fire for each refresh cycle regardless what the  
timeout is. And for a small timeout of a few milliseconds, when the  
desktop finally goes idle, you'd probably end up with at most one  
extra vblank irq, but save a little bit of cpu time for multiple on/ 
off transitions for each frame during the busy period.

Out of pure interest, how much power does vblank off actually save,  
compared to waking up the cpu every 16 msecs?

>> I agree with your patches, i just don't want vblank irq's to be
>> disabled immediately with a zero timeout before remaining races are
>> fixed, that would sure break things at least for scientific
>> applications. And i'd like to keep some optional parameter that
>> allows desparate users to disable the vblank off mechanism in case
>> they would encounter a bug.
>
> Yeah, no problem. I understand that completely.
>

Great, thanks.
-mario

> -- 
> Matthew Garrett | mjg59 at srcf.ucam.org

*********************************************************************
Mario Kleiner
Max Planck Institute for Biological Cybernetics
Spemannstr. 38
72076 Tuebingen
Germany

e-mail: mario.kleiner at tuebingen.mpg.de
office: +49 (0)7071/601-1623
fax:    +49 (0)7071/601-616
www:    http://www.kyb.tuebingen.mpg.de/~kleinerm
*********************************************************************
"For a successful technology, reality must take precedence
over public relations, for Nature cannot be fooled."
(Richard Feynman)