Funky new vblank counter regressions in Linux 4.4-rc1

Wed Nov 25 11:36:04 PST 2015

On Wed, Nov 25, 2015 at 08:04:26PM +0100, Mario Kleiner wrote:
> On 11/25/2015 06:46 PM, Ville Syrjälä wrote:
> > On Wed, Nov 25, 2015 at 06:24:13PM +0100, Mario Kleiner wrote:
> >> On 11/23/2015 09:24 PM, Ville Syrjälä wrote:
> >>> On Mon, Nov 23, 2015 at 06:58:34PM +0100, Mario Kleiner wrote:
> >>>>
> >>>>
> >>>> On 11/23/2015 04:51 PM, Ville Syrjälä wrote:
> >>>>> On Mon, Nov 23, 2015 at 04:23:21PM +0100, Mario Kleiner wrote:
> >>>>>> On 11/20/2015 04:34 PM, Ville Syrjälä wrote:
> >>>>>>> On Fri, Nov 20, 2015 at 04:24:50PM +0100, Mario Kleiner wrote:
> >>>>>>
> >>>>>> ...
> >>>>>> Ok, but why would that be a bad thing? I think we want it to think it is
> >>>>>> in the previous frame if it is called outside the vblank irq context.
> >>>>>> The only reason we fudge it to the next frames vblank if i vblank irq is
> >>>>>> because we know the vblank irq handler we are executing atm. was meant
> >>>>>> to execute within the upcoming vblank for the next frame, so we fudge
> >>>>>> the scanout positions and thereby timestamp to correspond to that new
> >>>>>> frame. But if something called outside irq context it should get a
> >>>>>> scanout position/timestamp that corresponds to "reality".
> >>>>>
> >>>>> It would be a bad thing since it would cause the timestamp to jump
> >>>>> backwards, and that would also cause the frame count guesstimate to go
> >>>>> backwards.
> >>>>>
> >>>>
> >>>> But only if we don't use the dev->driver->get_vblank_counter() method,
> >>>> which we try to use on AMD.
> >>>
> >>> Well, if you do it that way then you have the problem of the hw counter
> >>> seeming to jump forward by one after crossing the start of vblank (when
> >>> compared to the value you sampled when you processed the early vblank
> >>> interrupt).
> >>>
> >>
> >> Ok, finally i see the bad scenario that wouldn't get prevented by our
> >> current locking with the new vblank counting in the core. The vblank
> >> enable path is safe due to locking and discounting of redundant
> >> timestamps etc. But the disable path could go wrong:
> >>
> >> 1. Vblank irq fires, drm_handle_vblank() -> drm_update_vblank_count(),
> >> updates timestamps and counts "as if" in vblank -> incremented vblank
> >> count and timestamp now set in the future.
> >>
> >> 2. After vblank irq finishes, but just before leading edge of vblank,
> >> vblank_disable_and_save() executes, doesn't get bumped timestamp or
> >> count because before vblank and not in vblank irq. Now
> >> drm_update_vblank_count() would process a
> >> "new" timestamp and count from the past and we'd have time and counts
> >> going backwards, and bad things would happen.
> >>
> >> I haven't observed such a thing happening during testing so far,
> >> probably because the time window in which it could happen is tiny, but
> >> given how awfully bad it would be, it needs to be prevented.
> >>
> >> I had a look at the description of the Vblank irq in the "M76 Register
> >> Reference Guide" for older asics and the description suggests that the
> >> vblank irq fires when the crtc's line buffer is finished reading pixel
> >> data from the scanout buffer in memory for a frame, ie., when the line
> >> buffer read "enters" vblank. That would explain why the irq happens a
> >> few scanlines before actual vblank, because line buffer refills must
> >> obviously happen before the crtc can send pixel data from the line
> >> buffer to the encoders, so it would lead a bit in time. That also means
> >> we can't delay the vblank irq to actually happen at start of vblank and
> >> have to deal with the early vblank irq.
> >>
> >>> I guess one silly idea would be to defer the vblank interrupt processing
> >>> to a timer, and just schedule it a bit into the future from the actual
> >>> interrupt handler.
> >>>
> >>
> >> Timer would be bad because then we get problems with the pageflip
> >> completion irq sometimes being processed before the vblank irq,
> >
> > You you'd need to move page flip completion to happen from the vblank
> > timer too I suppose.
> >
> >> and
> >> because we want to be fast in vblank irq handling, delivering vblank
> >> events etc. I wouldn't trust a timer to be reliable enough for such
> >> short waits.
> >
> > hrtimers should be accurate. But maybe more expensive than the timer
> > wheel.
> >
> 
> Sounds all a bit complex and fraught with new possible complications. I 
> can't spend much more time on this than i did so far, and we need to get 
> this into 4.4-rc, so i am aiming for the most simple solution that would 
> work.
> 
> >> Busy waiting wouldn't be great either in irq.
> >>
> >> So what about this relatively simple one?
> >>
> >> 1. In radeon_get_crtc_scanoutpos() we artifically define the
> >> vblank_start line to be, e.g, 5 scanlines before the true start of
> >> vblank, so for the purpose of vblank counter queries and timestamping
> >> our "vblank" would start a bit earlier and the vblank irq would always
> >> execute in "vblank". Non-Irq invocations like vblank_disable_and_save()
> >> would also be treated to this early vblank start and so what the DRM
> >> core observes would always be consistent.
> >>
> >> 2. At least in radeon-kms we also use radeon_get_crtc_scanoutpos()
> >> internally for "dynpm" dynamic power management/reclocking, and to
> >> implement pageflip completion detection on asics older than DCE3 which
> >> don't have pageflip interrupts. For those cases we need to use the true
> >> start of vblank, so for this internal use we pass in some special flag
> >> into radeon_get_crtc_scanoutpos() to tell it to not shift the vblank
> >> start around.
> >>
> >> 3. I've added another check to my patch for drm_update_vblank_count() to
> >> catch timestamps going backwards and discounting such cases, for a bit
> >> of extra robustness against driver trouble.
> >>
> >> Any ideas why this would be a stupid idea? I'll try this now and just
> >> hope meanwhile nobody finds a reason why this would be bad.
> >
> > What I don't like is leaking any of this into the core code. All the
> > hacks should live in the hw specific code. We've managed to keep all
> > the i915 hacks hidden that way.
> >
> > So if you would:
> > - fudge your get_scanout_position as you suggested
> > - _not_ expose the hw counter to the vblank core since it
> >    disagrees with the scanout position
> > - keep the internal get_scanout_position variant/flag
> >    purely internal
> >
> > then I think I might like it. That way working hardware doesn't have to
> > be exposed to these hacks, and there's possible a bit less danger that it
> > all gets broken again next time the core needs some work.
> >
> 
> Ok. Exposing the fudged hw counter shouldn't be a problem though, given 
> that it would be fudged in a consistent way with the fudged scanout 
> positions and to have incremented at time of drm_handle_vblank(). We'd 
> bump the hw counter to the count of the new vblank at the same time when 
> the scanout positions would start counting backwards from minus 
> something to 0, showing how many vblank lines to go until start of 
> scanout, etc. The only difference to reality would be that this 
> simulated vblank would start 5 scanlines earlier than the true hw 
> vblank, but i can't see how the core would care about that.
> 
> 
> > One problem with that approach could be that the vblank event and page
> > flip event would be delievered at different times if you keep using the
> > page flip interrupt, but I'm not sure that would be a problem. At least
> > they should both have the same timestamp and counter value.
> >
> 
> That's the same now, and the timestamps and counts be the same. I'll 
> check with my measurement equipment that the timestamps will be still 
> accurate wrt. to reality.
> 
> Attached is my current patch i wanted to submit for the drm core's 
> drm_update_vblank_count(). I think it's good to make the core somewhat 
> robust against potential kms driver bugs or glitches. But if you 
> wouldn't like that patch, there wouldn't be much of a point sending it 
> out at all.
> 
> thanks,
> -mario
> 

> >From 2d5d58a1c575ad002ce2cb643f395d0e4757d959 Mon Sep 17 00:00:00 2001
> From: Mario Kleiner <mario.kleiner.de at gmail.com>
> Date: Wed, 25 Nov 2015 18:48:31 +0100
> Subject: [PATCH] drm/irq: Make drm_update_vblank_count() more robust.
> 
> The changes to drm_update_vblank_count() for Linux 4.4-rc
> made the function more fragile wrt. some hw quirks. E.g.,
> at dev->driver->enable_vblank(), AMD gpu's fire a spurious
> redundant vblank irq shortly after enabling vblank irqs, not
> locked to vblank. This causes a redundant call which needs
> to be suppressed to avoid miscounting.
> 
> To increase robustness, shuffle things around a bit:
> 
> On drivers with high precision vblank timestamping always
> evaluate the timestamp difference between current timestamp
> and previously recorded timestamp to detect such redundant
> invocations and no-op in that case.
> 
> Also detect and warn about timestamps going backwards to
> catch potential kms driver bugs.
> 
> This patch is meant for Linux 4.4-rc and later.
> 
> Signed-off-by: Mario Kleiner <mario.kleiner.de at gmail.com>
> ---
>  drivers/gpu/drm/drm_irq.c | 53 ++++++++++++++++++++++++++++++++++++-----------
>  1 file changed, 41 insertions(+), 12 deletions(-)
> 
> diff --git a/drivers/gpu/drm/drm_irq.c b/drivers/gpu/drm/drm_irq.c
> index 819b8c1..8728c3c 100644
> --- a/drivers/gpu/drm/drm_irq.c
> +++ b/drivers/gpu/drm/drm_irq.c
> @@ -172,9 +172,11 @@ static void drm_update_vblank_count(struct drm_device *dev, unsigned int pipe,
>  				    unsigned long flags)
>  {
>  	struct drm_vblank_crtc *vblank = &dev->vblank[pipe];
> -	u32 cur_vblank, diff;
> +	u32 cur_vblank, diff = 0;
>  	bool rc;
>  	struct timeval t_vblank;
> +	const struct timeval *t_old;
> +	u64 diff_ns;
>  	int count = DRM_TIMESTAMP_MAXRETRIES;
>  	int framedur_ns = vblank->framedur_ns;
>  
> @@ -195,13 +197,15 @@ static void drm_update_vblank_count(struct drm_device *dev, unsigned int pipe,
>  		rc = drm_get_last_vbltimestamp(dev, pipe, &t_vblank, flags);
>  	} while (cur_vblank != dev->driver->get_vblank_counter(dev, pipe) && --count > 0);
>  
> -	if (dev->max_vblank_count != 0) {
> -		/* trust the hw counter when it's around */
> -		diff = (cur_vblank - vblank->last) & dev->max_vblank_count;
> -	} else if (rc && framedur_ns) {
> -		const struct timeval *t_old;
> -		u64 diff_ns;
> -
> +	/*
> +	 * Always use vblank timestamping based method if supported to reject
> +	 * redundant vblank irqs. E.g., AMD hardware needs this to not screw up
> +	 * due to some irq handling quirk.
> +	 *
> +	 * This also sets the diff value for use as fallback below in case the
> +	 * hw does not support a suitable hw vblank counter.
> +	 */
> +	if (rc && framedur_ns) {

If you fudged everything properly why do you still need this? With
working hw counter there should be no need to do this stuff.

>  		t_old = &vblanktimestamp(dev, pipe, vblank->count);
>  		diff_ns = timeval_to_ns(&t_vblank) - timeval_to_ns(t_old);
>  
> @@ -212,11 +216,36 @@ static void drm_update_vblank_count(struct drm_device *dev, unsigned int pipe,
>  		 */
>  		diff = DIV_ROUND_CLOSEST_ULL(diff_ns, framedur_ns);
>  
> -		if (diff == 0 && flags & DRM_CALLED_FROM_VBLIRQ)
> +		/* Catch driver timestamping bugs and prevent worse. */
> +		if ((s32) diff < 0) {
> +			DRM_DEBUG_VBL("crtc %u: Timestamp going backward!"
> +			" diff_ns = %lld, framedur_ns = %d)\n",
> +			pipe, (long long) diff_ns, framedur_ns);
> +			return;
> +		}
> +
> +		if (diff == 0 && flags & DRM_CALLED_FROM_VBLIRQ) {
>  			DRM_DEBUG_VBL("crtc %u: Redundant vblirq ignored."
> -				      " diff_ns = %lld, framedur_ns = %d)\n",
> -				      pipe, (long long) diff_ns, framedur_ns);
> -	} else {
> +			" diff_ns = %lld, framedur_ns = %d)\n",
> +			pipe, (long long) diff_ns, framedur_ns);
> +
> +			/* Treat this redundant invocation as no-op. */
> +			WARN_ON_ONCE(cur_vblank != vblank->last);
> +			return;
> +		}
> +	}
> +
> +	/*
> +	 * hw counters, if marked as supported via max_vblank_count != 0,
> +	 * *must* increment at leading edge of vblank or at least before
> +	 * the firing of the hw vblank irq, otherwise we have a race here
> +	 * between gpu and us, where small variations in our execution
> +	 * timing can cause off-by-one miscounting of vblanks.
> +	 */
> +	if (dev->max_vblank_count != 0) {
> +		/* trust the hw counter when it's around */
> +		diff = (cur_vblank - vblank->last) & dev->max_vblank_count;
> +	} else if (!(rc && framedur_ns)) {
>  		/* some kind of default for drivers w/o accurate vbl timestamping */
>  		diff = (flags & DRM_CALLED_FROM_VBLIRQ) != 0;
>  	}
> -- 
> 1.9.1
> 

-- 
Ville Syrjälä
Intel OTC