[PATCH 5/6] drm: Prevent vblank counter jumps with timestamp based update method.

Mario Kleiner mario.kleiner.de at gmail.com
Wed Feb 10 16:28:26 UTC 2016

On 02/09/2016 04:03 PM, Daniel Vetter wrote:
> On Tue, Feb 09, 2016 at 04:11:49PM +0200, Ville Syrjälä wrote:
>> On Tue, Feb 09, 2016 at 02:53:30PM +0100, Mario Kleiner wrote:
>>> On 02/09/2016 11:09 AM, Daniel Vetter wrote:
>>>> On Mon, Feb 08, 2016 at 02:13:28AM +0100, Mario Kleiner wrote:
>>>>> The changes to drm_update_vblank_count() in Linux 4.4 added a
>>>>> method to emulate a hardware vblank counter by use of high
>>>>> precision vblank timestamps if a kms driver supports those,
>>>>> but doesn't suppport hw vblank counters.
>>>>> That method assumes that the old timestamp from a previous
>>>>> invocation is valid, but that is not always the case. E.g.,
>>>>> if drm_reset_vblank_timestamp() gets called during drm_vblank_on()
>>>>> or drm_update_vblank_count() gets called outside vblank irq and
>>>>> the high precision timestamping can't deliver a precise timestamp,
>>>>> ie. drm_get_last_vbltimestamp() delivers a return value of false,
>>>>> then those functions will initialize the old timestamp to zero
>>>>> to mark it as invalid.
>>>>> A following call to drm_update_vblank_count() would then calculate
>>>>> elapsed time with vblank irqs off as current vblank timestamp minus
>>>>> the zero old timestamp and compute a software vblank counter increment
>>>>> that corresponds to system uptime, causing a large forward jump of the
>>>>> software vblank counter. That jump in turn can cause too long waits
>>>>> in drmWaitVblank and very long delays in delivery of vblank events,
>>>>> resulting in hangs of userspace clients.
>>>>> This problem can be observed on nouveau-kms during machine
>>>>> suspend->resume cycles, where drm_vblank_off is called during
>>>>> suspend, drm_vblank_on is called during resume and the first
>>>>> queries to drm_get_last_vbltimestamp() don't deliver high
>>>>> precision timestamps, resulting in a large harmful counter jump.
>>>> Why does nouveau enable vblank interrupts before it can get valid
>>>> timestamps? That sounds like the core bug here, and papering over that in
>>>> the vblank code feels very wrong to me. Maybe we should instead just not
>>>> sample the vblank at all if the timestamp fails and splat a big WARN_ON
>>>> about this, to give driver writers a chance to fix up their mess?
>>>> -Daniel
>>> The high precision timestamping is allowed to fail for a kms driver
>>> under some conditions which are not implementation errors of the driver,
>>> or getting disabled by user override, so we should handle that robustly.
>>> We handle it robustly everywhere else in the drm, so we should do it
>>> here as well.
>>> E.g., nouveau generally supports timestamping/scanout position queries,
>>> but can't support it on old pre NV-50 hardware with some output type
>>> (either on analog VGA, or DVI-D, can't remember atm. which one is
>>> unsupported).
>> I think the surprising thing here is that it fails first and then
>> succeeds on the same crtc/output combo presumably.
> Yeah exactly this. Failing consistently is ok imo (and probably should be
> handled). Failing first and then later on working (without changing the
> mode config in between) seems suspicous. That shouldn't ever happen really
> and seems like a driver bug (like enabling vblanks too early, before the
> pipe is fully up&running).
> -Daniel
>> I guess in theory the driver could fail during random times for whatever
>> reason, though I tend to think that the driver should either make it
>> robust or not even pretend to support it.
>> I suppose one failure mode the driver can't really do anything about is
>> some random SMI crap or something stalling the timestamp queries enough
>> that we fail the precisions tests and exhaust the retry limits. So yeah,
>> making it robust against that kind of nastyness sounds line it might be
>> a good idea. Though perhaps it should be something a bit more severe
>> than a DRM_DEBUG since I think it really shouldn't happen when the
>> driver and system are otherwise sane. Of course if it routinely fails
>> with some driver then simply making it spew errors into dmesg isn't
>> so nice, unless the driver also gets fixed.

I think i have an idea what might go wrong with nouveau, so i'll see if 
i can add a fixup patch.

There's another scenario where this zero-ts case can be hit. If the 
driver drm_vblank_init()'s - setting all timestamps to zero - and then 
code starts using vblanks (drm_vblank_get()) without drm_vblank_on 
beforehand, which is afaics the case with nouveau. Unless that's 
considered an error as well, we'd need to init the timestamps to 
something non-zero but harmless like 1 usecs at drm_vblank_init() time?
What makes sense as output here? DRM_WARN_ONCE?


>>> There are also new Soc drivers showing up which support those methods,
>>> but at least i didn't verify or test if their implementations are good
>>> enough for the needs of the new drm_udpate_vblank_count() which is more
>>> sensitive to kms drivers being even slightly off.
>>> -mario
>>>>> Fix this by checking if the old timestamp used for this calculations
>>>>> is zero == invalid. If so, perform a counter increment of +1 to
>>>>> prevent large counter jumps and reinitialize the timestamps to
>>>>> sane values.
>>>>> Signed-off-by: Mario Kleiner <mario.kleiner.de at gmail.com>
>>>>> Cc: <stable at vger.kernel.org> # 4.4+
>>>>> Cc: michel at daenzer.net
>>>>> Cc: vbabka at suse.cz
>>>>> Cc: ville.syrjala at linux.intel.com
>>>>> Cc: daniel.vetter at ffwll.ch
>>>>> Cc: dri-devel at lists.freedesktop.org
>>>>> Cc: alexander.deucher at amd.com
>>>>> Cc: christian.koenig at amd.com
>>>>> ---
>>>>>    drivers/gpu/drm/drm_irq.c | 7 +++++++
>>>>>    1 file changed, 7 insertions(+)
>>>>> diff --git a/drivers/gpu/drm/drm_irq.c b/drivers/gpu/drm/drm_irq.c
>>>>> index fb17c45..88bdf19 100644
>>>>> --- a/drivers/gpu/drm/drm_irq.c
>>>>> +++ b/drivers/gpu/drm/drm_irq.c
>>>>> @@ -216,6 +216,13 @@ static void drm_update_vblank_count(struct drm_device *dev, unsigned int pipe,
>>>>>    			DRM_DEBUG_VBL("crtc %u: Redundant vblirq ignored."
>>>>>    				      " diff_ns = %lld, framedur_ns = %d)\n",
>>>>>    				      pipe, (long long) diff_ns, framedur_ns);
>>>>> +
>>>>> +		/* No valid t_old to calculate diff? Bump +1 to force reinit. */
>>>>> +		if (t_old->tv_sec == 0 && t_old->tv_usec == 0) {
>>>>> +			DRM_DEBUG_VBL("crtc %u: No baseline ts. Bump +1.\n",
>>>>> +				      pipe);
>>>>> +			diff = 1;
>>>>> +		}
>>>>>    	} else {
>>>>>    		/* some kind of default for drivers w/o accurate vbl timestamping */
>>>>>    		diff = (flags & DRM_CALLED_FROM_VBLIRQ) != 0;
>>>>> --
>>>>> 1.9.1
>> --
>> Ville Syrjälä
>> Intel OTC

More information about the dri-devel mailing list