[PATCH 5/6] drm: Prevent vblank counter jumps with timestamp based update method.
Mario Kleiner
mario.kleiner.de at gmail.com
Wed Feb 10 16:28:26 UTC 2016
On 02/09/2016 04:03 PM, Daniel Vetter wrote:
> On Tue, Feb 09, 2016 at 04:11:49PM +0200, Ville Syrjälä wrote:
>> On Tue, Feb 09, 2016 at 02:53:30PM +0100, Mario Kleiner wrote:
>>> On 02/09/2016 11:09 AM, Daniel Vetter wrote:
>>>> On Mon, Feb 08, 2016 at 02:13:28AM +0100, Mario Kleiner wrote:
>>>>> The changes to drm_update_vblank_count() in Linux 4.4 added a
>>>>> method to emulate a hardware vblank counter by use of high
>>>>> precision vblank timestamps if a kms driver supports those,
>>>>> but doesn't suppport hw vblank counters.
>>>>>
>>>>> That method assumes that the old timestamp from a previous
>>>>> invocation is valid, but that is not always the case. E.g.,
>>>>> if drm_reset_vblank_timestamp() gets called during drm_vblank_on()
>>>>> or drm_update_vblank_count() gets called outside vblank irq and
>>>>> the high precision timestamping can't deliver a precise timestamp,
>>>>> ie. drm_get_last_vbltimestamp() delivers a return value of false,
>>>>> then those functions will initialize the old timestamp to zero
>>>>> to mark it as invalid.
>>>>>
>>>>> A following call to drm_update_vblank_count() would then calculate
>>>>> elapsed time with vblank irqs off as current vblank timestamp minus
>>>>> the zero old timestamp and compute a software vblank counter increment
>>>>> that corresponds to system uptime, causing a large forward jump of the
>>>>> software vblank counter. That jump in turn can cause too long waits
>>>>> in drmWaitVblank and very long delays in delivery of vblank events,
>>>>> resulting in hangs of userspace clients.
>>>>>
>>>>> This problem can be observed on nouveau-kms during machine
>>>>> suspend->resume cycles, where drm_vblank_off is called during
>>>>> suspend, drm_vblank_on is called during resume and the first
>>>>> queries to drm_get_last_vbltimestamp() don't deliver high
>>>>> precision timestamps, resulting in a large harmful counter jump.
>>>>
>>>> Why does nouveau enable vblank interrupts before it can get valid
>>>> timestamps? That sounds like the core bug here, and papering over that in
>>>> the vblank code feels very wrong to me. Maybe we should instead just not
>>>> sample the vblank at all if the timestamp fails and splat a big WARN_ON
>>>> about this, to give driver writers a chance to fix up their mess?
>>>> -Daniel
>>>>
>>>
>>> The high precision timestamping is allowed to fail for a kms driver
>>> under some conditions which are not implementation errors of the driver,
>>> or getting disabled by user override, so we should handle that robustly.
>>> We handle it robustly everywhere else in the drm, so we should do it
>>> here as well.
>>>
>>> E.g., nouveau generally supports timestamping/scanout position queries,
>>> but can't support it on old pre NV-50 hardware with some output type
>>> (either on analog VGA, or DVI-D, can't remember atm. which one is
>>> unsupported).
>>
>> I think the surprising thing here is that it fails first and then
>> succeeds on the same crtc/output combo presumably.
>
> Yeah exactly this. Failing consistently is ok imo (and probably should be
> handled). Failing first and then later on working (without changing the
> mode config in between) seems suspicous. That shouldn't ever happen really
> and seems like a driver bug (like enabling vblanks too early, before the
> pipe is fully up&running).
> -Daniel
>
>>
>> I guess in theory the driver could fail during random times for whatever
>> reason, though I tend to think that the driver should either make it
>> robust or not even pretend to support it.
>>
>> I suppose one failure mode the driver can't really do anything about is
>> some random SMI crap or something stalling the timestamp queries enough
>> that we fail the precisions tests and exhaust the retry limits. So yeah,
>> making it robust against that kind of nastyness sounds line it might be
>> a good idea. Though perhaps it should be something a bit more severe
>> than a DRM_DEBUG since I think it really shouldn't happen when the
>> driver and system are otherwise sane. Of course if it routinely fails
>> with some driver then simply making it spew errors into dmesg isn't
>> so nice, unless the driver also gets fixed.
>>
I think i have an idea what might go wrong with nouveau, so i'll see if
i can add a fixup patch.
There's another scenario where this zero-ts case can be hit. If the
driver drm_vblank_init()'s - setting all timestamps to zero - and then
code starts using vblanks (drm_vblank_get()) without drm_vblank_on
beforehand, which is afaics the case with nouveau. Unless that's
considered an error as well, we'd need to init the timestamps to
something non-zero but harmless like 1 usecs at drm_vblank_init() time?
What makes sense as output here? DRM_WARN_ONCE?
-mario
>>>
>>> There are also new Soc drivers showing up which support those methods,
>>> but at least i didn't verify or test if their implementations are good
>>> enough for the needs of the new drm_udpate_vblank_count() which is more
>>> sensitive to kms drivers being even slightly off.
>>>
>>> -mario
>>>
>>>>>
>>>>> Fix this by checking if the old timestamp used for this calculations
>>>>> is zero == invalid. If so, perform a counter increment of +1 to
>>>>> prevent large counter jumps and reinitialize the timestamps to
>>>>> sane values.
>>>>>
>>>>> Signed-off-by: Mario Kleiner <mario.kleiner.de at gmail.com>
>>>>> Cc: <stable at vger.kernel.org> # 4.4+
>>>>> Cc: michel at daenzer.net
>>>>> Cc: vbabka at suse.cz
>>>>> Cc: ville.syrjala at linux.intel.com
>>>>> Cc: daniel.vetter at ffwll.ch
>>>>> Cc: dri-devel at lists.freedesktop.org
>>>>> Cc: alexander.deucher at amd.com
>>>>> Cc: christian.koenig at amd.com
>>>>> ---
>>>>> drivers/gpu/drm/drm_irq.c | 7 +++++++
>>>>> 1 file changed, 7 insertions(+)
>>>>>
>>>>> diff --git a/drivers/gpu/drm/drm_irq.c b/drivers/gpu/drm/drm_irq.c
>>>>> index fb17c45..88bdf19 100644
>>>>> --- a/drivers/gpu/drm/drm_irq.c
>>>>> +++ b/drivers/gpu/drm/drm_irq.c
>>>>> @@ -216,6 +216,13 @@ static void drm_update_vblank_count(struct drm_device *dev, unsigned int pipe,
>>>>> DRM_DEBUG_VBL("crtc %u: Redundant vblirq ignored."
>>>>> " diff_ns = %lld, framedur_ns = %d)\n",
>>>>> pipe, (long long) diff_ns, framedur_ns);
>>>>> +
>>>>> + /* No valid t_old to calculate diff? Bump +1 to force reinit. */
>>>>> + if (t_old->tv_sec == 0 && t_old->tv_usec == 0) {
>>>>> + DRM_DEBUG_VBL("crtc %u: No baseline ts. Bump +1.\n",
>>>>> + pipe);
>>>>> + diff = 1;
>>>>> + }
>>>>> } else {
>>>>> /* some kind of default for drivers w/o accurate vbl timestamping */
>>>>> diff = (flags & DRM_CALLED_FROM_VBLIRQ) != 0;
>>>>> --
>>>>> 1.9.1
>>>>>
>>>>
>>
>> --
>> Ville Syrjälä
>> Intel OTC
>
More information about the dri-devel
mailing list