[Intel-gfx] [PATCH] drm: Aggressively disable vblanks

Sun Jan 9 18:24:18 PST 2011

On Mon, Dec 27, 2010 at 7:58 AM, Mario Kleiner
<mario.kleiner at tuebingen.mpg.de> wrote:
> On Dec 26, 2010, at 3:53 PM, Andrew Lutomirski wrote:
>
>> On Wed, Dec 22, 2010 at 4:06 PM, Mario Kleiner
>> <mario.kleiner at tuebingen.mpg.de> wrote:
>>>
>>> There's a new drm module parameter for selecting the timeout: echo 50 >
>>> /sys/module/drm/parameters/vblankoffdelay
>>> would set the timeout to 50 msecs. A setting of zero will disable the
>>> timer,
>>> so vblank irq's would stay on all the time.
>>>
>>> The default setting is still 5000 msecs as before, but reducing this to
>>> 100
>>> msecs wouldn't be a real problem imho. At least i didn't observe any
>>> miscounting during extensive testing with 100 msecs.
>>>
>>> The patches in drm-next fix a couple of races that i observed on intel
>>> and
>>> radeon during testing and a few that i didn't see but that i could
>>> imagine
>>> happening. It tries to make sure that the saved final count at vblank irq
>>> disable of the software vblank_count and the gpu counter are consistent -
>>> no
>>> off by one errors. They also try to detect and filter out spurious vblank
>>> interrupts at vblank enable time, e.g., on the radeon.
>>>
>>> There's still one possible race in the disable path which i will try to
>>> fix:
>>> We don't know when exactly the hardware counter increments wrt. the
>>> processing of the vblank interrupt - it could increment a few
>>> (dozen/hundred) microseconds before or after the irq handler runs, so if
>>> you
>>> happen to query the hardware counter while the gpu is inside the vblank
>>> you
>>> can't be sure if you picked up the old count or the new count for that
>>> vblank.
>>
>> That's disgusting.  Does this affect many GPUs?  (I can't imagine why
>> any sensible implementation wouldn't guarantee that the counter
>> increments just before the IRQ.)
>
> ;-). I don't know, but at least on the tested R500 and R600 class Radeon's,
> this was the case, so i assume it's at least this way on many radeon gpu's
> (probably all avivo parts?) out there. We don't have any evergreen gpu's yet
> in our lab so i don't know how the more recent parts behave. Also it doesn't
> matter
>
> I guess it's also a matter of definition when a new video frame starts?
> Leading edge / trailing edge of vblank? Start of vsync? Something else?
>
>>>
>>> This only matters during vblank disable. For that reason it's not such a
>>> good idea to disable vblank irq's from within the vblank irq handler. I
>>> tried that and it didn't work well --> When doing it from within irq you
>>> basically maximize the chance of hitting the time window when the race
>>> can
>>> happen. Delaying within the irq handler by a millisecond would fix that,
>>> but
>>> that's not what we want.
>>>
>>> Having the disable in a function triggered by a timer like now is the
>>> most
>>> simple solution i could come up with. There we can burn a few dozen
>>> microseconds if neccessary to remove this race.
>>
>> Maybe I'm missing something, but other than the race above (which
>> seems like it shouldn't exist on sane hardware), the new code seems
>> more complicated than necessary.
>
> I don't think it's more complicated than necessary for what it tries to
> achieve, but of course i'm a bit biased. It also started off more simple and
> grew a bit when i found new issues with the tested gpu's.
>
> The aim is to fix a couple of real races and to make vblank counts and
> timestamps as trustworthy and oml_sync_control spec-conformant (see
> http://www.opengl.org/registry/specs/OML/glx_sync_control.txt) as possible.
> It only consumes a few additional bytes of memory (approx. 40 bytes) per
> crtc, doesn't use excessive time inside the irq handler and tries to avoid
> taking locks  that are shared between irq and non-irq context to avoid
> delays in irq execution, also if used with a kernel with the preempt_rt
> patch applied (important for my use case and other hard realtime apps). It's
> pretty self-contained and because most of it is driver-independent it can
> handle similar issues on different gpu's and kms drivers without the need
> for us to check each single gpu + driver combo if it has such issues or not.
>

OK, having read the glx_sync_control spec this makes a lot more sense.
 I reread the code and I have two questions right off the bat:

1. What's the point of vblank_disable_and_save?  AFAICT all that it
does (other than disabling vblanks) is to possibly increment
_vblank_count, ensuring that (barring races) it's correct as soon as
the call returns.  But there are only three ways that
vblank_disable_and_save gets called: the disable timer, and two paths
in i915 that turn off vblanks when disabling crtcs.  The latter two
shouldn't really care because the crtc is about to turn off, and the
former, being a timer, doesn't return into anything that cares about
the counter.

2. Why are the timestamps coming from do_gettimeofday instead of
ktime_get?  Currently, if someone changes the system time, then weird
things could happen, I think.

--Andy