linux-4.4 bisected: kwin5 stuck on kde5 loading screen with radeon

Mario Kleiner mario.kleiner.de at gmail.com
Fri Jan 15 20:24:05 PST 2016



On 01/15/2016 01:26 PM, Ville Syrjälä wrote:
> On Fri, Jan 15, 2016 at 11:34:08AM +0100, Vlastimil Babka wrote:
>> Hi,
>>
>> since kernel 4.4 I'm unable to login to kde5 desktop (on openSUSE
>> Tumbleweed). There's a screen with progressbar showing the startup,
>> which normally fades away after reaching 100%. But with kernel 4.4, the
>> progress gets stuck somewhere between 1/2 and 3/4 (not always the same).
>> Top shows that kwin is using few % of CPU's but mostly sleeps in poll().
>> When I kill it from another console, I see that everything has actually
>> started up, just the progressbar screen was obscuring it. The windows
>> obviously don't have decorations etc. Starting kwin manually again shows
>> me again the progressbar screen at the same position.
>

Depressing. I was stress-testing those patches with Linux 4.4 for days 
on 2 AMD gpu's (HD-4000 RV 730 and HD-5770) under KDE 5 Plasma 5.4.2 
(KUbuntu 15.10, XOrg 1.17) and just retested Linux 4.4 on 
nouveau/radeon/intel also with XOrg 1.18 and XOrg master) with Linux 4.4 
a few days ago and never encountered such a hang or other vblank related 
problem on KDE-5 or GNOME-3.

I'm currently running...

while xinit /usr/bin/ksplashqml --test -- :1 ; do echo yay; done

... in an endless loop on Linux 4.4 SMP PREEMPT on HD-5770  and so far i 
can't trigger a hang after hundreds of runs.

Does this also hang for you?

> Hmm. Sounds like it could then be waiting for a vblank in the distant
> future. There's that 1<<23 limit in the code though, but even with that
> we end up with a max wait of ~38 hours assuming a 60Hz refresh rate.
>

xtrace suggests that ksplashqml seems to use classic OpenGL  + 
glXSwapBuffers under DRI2. So no clever swap scheduling based on vblank 
counter values.

> Stuff to try might include enabling drm.debug=0x2f, though that'll
> generate a lot of stuff. Another option would be to use the drm vblank
> tracepoints to try and catch what seq number it's waiting for and
> where we're at currently. Or I suppose you could just hack
> up drm_wait_vblank() to print an error message or something if the
> requested seq number is in the future by, say, more than a few seconds,
> and if that's the case then we could try to figure out why that happens.
>
>>
>> I have suspected that kwin is waiting for some event, but nevertheless
>> tried bisecting the kernel between 4.3 and 4.4, which lead to:
>>
>> # first bad commit: [4dfd64862ff852df7b1198d667dda778715ee88f] drm: Use
>> vblank timestamps to guesstimate how many vblanks were missed
>>
>> I can confirm that 4.4 works if I revert the following commits:
>> 63154ff230fc9255cc507af6277cd181943c50a1 "drm/amdgpu: Fixup hw vblank
>> counter/ts for new drm_update_vblank_count() (v3)"
>>
>> d1145ad1e41b6c33758a856163198cb53bb96a50 "drm/radeon: Fixup hw vblank
>> counter/ts for new drm_update_vblank_count() (v2)"
>
> The sha1s don't seem to match what I have, so not sure which kernel tree
> you have, but looking at the radeon commit at least one thing
> immediately caught my attention;
>
> +                       /* Bump counter if we are at >= leading edge of vblank,
> +                        * but before vsync where vpos would turn negative and
> +                        * the hw counter really increments.
> +                        */
> +                       if (vpos >= 0)
> +                               count++;
>
> It's rather hard to see what it's really doing since the custom flags to
> the get_scanout_position now cause it return non-standard things. But if
> I'm reading things correctly it should really say something like:
>
> if (vpos >= 0 && vpos < (vsync_start - vblank_start))
> 	count++;
>
> Hmm. Actually even that might not be correct since it could be using the
> "fake" vblank start here, so might be it'd need to be something like:
>
> if (vpos >= 0 && vpos < (vsync_start - vblank_start + lb_vblank_lead_lines)
> 	count++;
>

The current code should be correct. vpos here returns the distance of hw 
vertical scanout position to the start of vblank. According to Alex and 
Harry Wentland of AMD's display team, and my testing of my two cards the 
hw vertical scanout position resets to zero at start line of vsync, 
therefore the "vpos" in that code becomes negative at start of vsync. At 
the same time the hw frame counter increments by one, making that 
"count++" to bump the returned count by +1 no longer neccessary.

If the reset of hw vertical scanout pos to zero and the increment of hw 
frame counter wouldn't happen at exactly the same time at start of vsync 
i could see how that could cause two successive queries of 
driver->get_vblank_counter() could report a count of N+1 and then N if 
the timing of both calls would be just perfectly right. That would cause 
the DRM code to falsely detect counter wraparound and jump the vblank 
counter forward by 2^24.

My tested gpu's had DCE-3 or DCE-4 display engines, Caicos has DCE-5, so 
could this be some hw quirk for DCE-5?


> Also might be worth a shot to just ignore the hw frame counter. Eg.:
>
> index e266ffc520d2..db728580549a 100644
> --- a/drivers/gpu/drm/radeon/radeon_drv.c
> +++ b/drivers/gpu/drm/radeon/radeon_drv.c
> @@ -492,7 +492,6 @@ static struct drm_driver kms_driver = {
>          .lastclose = radeon_driver_lastclose_kms,
>          .set_busid = drm_pci_set_busid,
>          .unload = radeon_driver_unload_kms,
> -       .get_vblank_counter = radeon_get_vblank_counter_kms,
>          .enable_vblank = radeon_enable_vblank_kms,
>          .disable_vblank = radeon_disable_vblank_kms,
>          .get_vblank_timestamp = radeon_get_vblank_timestamp_kms,
> diff --git a/drivers/gpu/drm/radeon/radeon_irq_kms.c b/drivers/gpu/drm/radeon/radeon_irq_kms.c
> index 979f3bf65f2c..3c5fcab74152 100644
> --- a/drivers/gpu/drm/radeon/radeon_irq_kms.c
> +++ b/drivers/gpu/drm/radeon/radeon_irq_kms.c
> @@ -152,11 +152,6 @@ int radeon_driver_irq_postinstall_kms(struct drm_device *dev)
>   {
>          struct radeon_device *rdev = dev->dev_private;
>
> -       if (ASIC_IS_AVIVO(rdev))
> -               dev->max_vblank_count = 0x00ffffff;
> -       else
> -               dev->max_vblank_count = 0x001fffff;
> -
>          return 0;
>   }
>
> assuming I'm reading the code correctly.
>
>>
>> 31ace027c9f1f8e0a2b09bbf961e4db7b1f6cf19 "drm: Don't zero vblank
>> timestamps from the irq handler"
>>
>> ac0567a4b132fa66e3edf3f913938af9daf7f916 "drm: Add DRM_DEBUG_VBL()"
>>
>> 4dfd64862ff852df7b1198d667dda778715ee88f "drm: Use vblank timestamps to
>> guesstimate how many vblanks were missed"
>>
>> All clean reverts, just needs some fixup on top to use abs() instead of
>> abs64() due to 79211c8ed19c055ca105502c8733800d442a0ae6.
>>
>> Unfortunately I don't know if this is a kernel problem or kwin problem.
>> I tried to CC maintainers of both, advices what to try or what info to
>> provide welcome. The card is "CAICOS" with 1GB memory.
>>

I think a drm.debug=0x21 setting and grep'ping the syslog for "vblank" 
should probably give useful info around the time of the hang.

Maybe also check XOrg.0.log for (WW) warnings related to flip.

thanks,
-mario


>> Thanks,
>> Vlastimil
>


More information about the dri-devel mailing list