Funky new vblank counter regressions in Linux 4.4-rc1

Wed Nov 25 11:38:04 PST 2015

On Wed, Nov 25, 2015 at 1:21 PM, Mario Kleiner
<mario.kleiner.de at gmail.com> wrote:
> On 11/25/2015 06:58 PM, Ville Syrjälä wrote:
>>
>> On Wed, Nov 25, 2015 at 06:24:13PM +0100, Mario Kleiner wrote:
>>>
>>> On 11/23/2015 09:24 PM, Ville Syrjälä wrote:
>>>>
>>>> On Mon, Nov 23, 2015 at 06:58:34PM +0100, Mario Kleiner wrote:
>>>>>
>>>>>
>>>>>
>>>>> On 11/23/2015 04:51 PM, Ville Syrjälä wrote:
>>>>>>
>>>>>> On Mon, Nov 23, 2015 at 04:23:21PM +0100, Mario Kleiner wrote:
>>>>>>>
>>>>>>> On 11/20/2015 04:34 PM, Ville Syrjälä wrote:
>>>>>>>>
>>>>>>>> On Fri, Nov 20, 2015 at 04:24:50PM +0100, Mario Kleiner wrote:
>>>>>>>
>>>>>>>
>>>>>>> ...
>>>>>>> Ok, but why would that be a bad thing? I think we want it to think it
>>>>>>> is
>>>>>>> in the previous frame if it is called outside the vblank irq context.
>>>>>>> The only reason we fudge it to the next frames vblank if i vblank irq
>>>>>>> is
>>>>>>> because we know the vblank irq handler we are executing atm. was
>>>>>>> meant
>>>>>>> to execute within the upcoming vblank for the next frame, so we fudge
>>>>>>> the scanout positions and thereby timestamp to correspond to that new
>>>>>>> frame. But if something called outside irq context it should get a
>>>>>>> scanout position/timestamp that corresponds to "reality".
>>>>>>
>>>>>>
>>>>>> It would be a bad thing since it would cause the timestamp to jump
>>>>>> backwards, and that would also cause the frame count guesstimate to go
>>>>>> backwards.
>>>>>>
>>>>>
>>>>> But only if we don't use the dev->driver->get_vblank_counter() method,
>>>>> which we try to use on AMD.
>>>>
>>>>
>>>> Well, if you do it that way then you have the problem of the hw counter
>>>> seeming to jump forward by one after crossing the start of vblank (when
>>>> compared to the value you sampled when you processed the early vblank
>>>> interrupt).
>>>>
>>>
>>> Ok, finally i see the bad scenario that wouldn't get prevented by our
>>> current locking with the new vblank counting in the core. The vblank
>>> enable path is safe due to locking and discounting of redundant
>>> timestamps etc. But the disable path could go wrong:
>>>
>>> 1. Vblank irq fires, drm_handle_vblank() -> drm_update_vblank_count(),
>>> updates timestamps and counts "as if" in vblank -> incremented vblank
>>> count and timestamp now set in the future.
>>>
>>> 2. After vblank irq finishes, but just before leading edge of vblank,
>>> vblank_disable_and_save() executes, doesn't get bumped timestamp or
>>> count because before vblank and not in vblank irq. Now
>>> drm_update_vblank_count() would process a
>>> "new" timestamp and count from the past and we'd have time and counts
>>> going backwards, and bad things would happen.
>>>
>>> I haven't observed such a thing happening during testing so far,
>>> probably because the time window in which it could happen is tiny, but
>>> given how awfully bad it would be, it needs to be prevented.
>>>
>>> I had a look at the description of the Vblank irq in the "M76 Register
>>> Reference Guide" for older asics and the description suggests that the
>>> vblank irq fires when the crtc's line buffer is finished reading pixel
>>> data from the scanout buffer in memory for a frame, ie., when the line
>>> buffer read "enters" vblank.
>>
>>
>> Hmm. Does that mean there's always at least one fullscreen plane enabled
>> in the hw? As in you can't turn off the primary plane or make it smaller
>> than the active video area? Othwewise it sounds like you'd could either
>> not get it at all, or get it somewhere in the middle of the screen.
>>
>
> It says "Interrupt that can be programmed to be generated by the
> primary display controller's line buffer logic either when the
> source image line counter is not requesting any active
> display data (i.e. in the vertical blank) or the output CRTC
> timing generator is within the vertical blanking region."
>
> So my statements were my interpretation of this quote, so i can make some
> sense out of the vblank irq behaviour. I guess Alex or Harry would know? The
> M76 reference refers to some older asics, i just assume it is the same for
> the current ones, given that observed behaviour would be consistent with the
> line buffer causing this lead of a couple of scanlines. I see about 2
> scanlines on DCE4 and about 3 scanlines on DCE3. I don't know how big the
> line buffer is, how quickly it refills etc., but it sounds reasonable.

The size of the line buffer varies by generation, but the LB logic is
still responsible for generating the vblank interrupt even on newer
hw.

Alex