Long radeon stalls on recent kernels

Wed Nov 26 07:38:03 PST 2014

On Tue, Nov 25, 2014 at 10:42 PM, Michel Dänzer <michel at daenzer.net> wrote:
> On 20.11.2014 09:58, Andy Lutomirski wrote:
>>
>> On Wed, Nov 19, 2014 at 4:07 PM, Andy Lutomirski <luto at amacapital.net>
>> wrote:
>>>
>>> On Tue, Nov 18, 2014 at 11:19 PM, Michel Dänzer <michel at daenzer.net>
>>> wrote:
>>>>
>>>> On 19.11.2014 09:21, Andy Lutomirski wrote:
>>>>>
>>>>>
>>>>> On Mon, Nov 17, 2014 at 1:51 AM, Michel Dänzer <michel at daenzer.net>
>>>>> wrote:
>>>>>>
>>>>>>
>>>>>> On 15.11.2014 07:21, Andy Lutomirski wrote:
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On recent kernels (3.16 through 3.18-rc4, perhaps), doing anything
>>>>>>> graphics intensive seems to cause my system to become unusable for
>>>>>>> tens of seconds.  Pointing Firefox at Google Maps is a big offender
>>>>>>> --
>>>>>>> it can take several minutes for me to move my mouse far enough to
>>>>>>> close the tab and get my computer back.
>>>>>>>
>>>>>>> On bootup, I get this warning:
>>>>>>> [drm:btc_dpm_set_power_state] *ERROR*
>>>>>>> rv770_restrict_performance_levels_before_switch failed
>>>>>>>
>>>>>>> Setting radeon.dpm=0 seems to work around this problem at the cost of
>>>>>>> giving my rather slow graphics.
>>>>>>>
>>>>>>> Are there known issues here?
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> Can you bisect the kernel, or at least isolate which kernel version
>>>>>> first
>>>>>> introduced the problem?
>>>>>
>>>>>
>>>>>
>>>>> With whatever userspace I'm running, I'm seeing it 3.13, 3.14, 3.15,
>>>>> 3.16, and 3.18-rc4+.  I haven't tried other versions.
>>>>>
>>>>> With radeon.dpm=0, I can still trigger short stalls (around one
>>>>> second), but I seem unable to trigger long stalls easily.  (I say
>>>>> easily because, just as I was typing this email, my system stalled for
>>>>> about a minute.)
>>>>
>>>>
>>>>
>>>> I can only think of two things offhand that could cause such extremely
>>>> long
>>>> stalls: Swap thrashing or IRQ storms.
>>>>
>>>> With a setup where you can easily trigger long stalls, can you try
>>>> getting a
>>>> CPU profile for a stall with sysprof or perf?
>>>>
>>>>
>>>
>>> Got one with perf:
>>>
>>>    16.82%             Xorg  libc-2.18.so                        [.]
>>> __memcpy_sse2_unaligned
>>>     9.20%          swapper  [kernel.kallsyms]                   [k]
>>> intel_idle
>>>     1.00%             Xorg  [kernel.kallsyms]                   [k]
>>> evergreen_irq_set
>>>     0.83%          firefox  libxul.so                           [.]
>>> 0x0000000001d93281
>>>     0.69%          firefox  libxul.so                           [.]
>>> 0x0000000001d932ad
>>>     0.62%          firefox  [kernel.kallsyms]                   [k]
>>> copy_user_generic_string
>>>     0.55%          swapper  [kernel.kallsyms]                   [k]
>>> evergreen_irq_ack
>>>     0.54%          firefox  libpthread-2.18.so                  [.]
>>> pthread_mutex_lock
>>>     0.52%          firefox  libpthread-2.18.so                  [.]
>>> pthread_mutex_unlock
>>>     0.45%             Xorg  [kernel.kallsyms]                   [k]
>>> drm_mm_insert_node_in_range_generic
>>>     0.41%             Xorg  [kernel.kallsyms]                   [k]
>>> lock_release
>>>     0.40%             Xorg  [kernel.kallsyms]                   [k]
>>> lock_acquire
>>>     0.35%          firefox  firefox                             [.]
>>> 0x000000000001245d
>>>     0.33%             Xorg  [kernel.kallsyms]                   [k]
>>> __module_address
>>>     0.31%          firefox  [kernel.kallsyms]                   [k]
>>> clear_page_c
>>>     0.29%             Xorg  [kernel.kallsyms]                   [k]
>>> copy_user_generic_string
>>>     0.28%          firefox  firefox                             [.]
>>> 0x0000000000013159
>>>
>>> and:
>>>
>>> Samples: 11K of event 'irq:irq_handler_entry', Event count (approx.):
>>> 11802
>>>    87.43%          swapper  [kernel.kallsyms]  [k]
>>> handle_irq_event_percpu
>>>     7.52%          firefox  [kernel.kallsyms]  [k]
>>> handle_irq_event_percpu
>>>     1.84%      irq/36-ahci  [kernel.kallsyms]  [k]
>>> handle_irq_event_percpu
>>>     1.14%             Xorg  [kernel.kallsyms]  [k]
>>> handle_irq_event_percpu
>>>     0.75%      kworker/5:0  [kernel.kallsyms]  [k]
>>> handle_irq_event_percpu
>>>     0.32%      gnome-shell  [kernel.kallsyms]  [k]
>>> handle_irq_event_percpu
>>>     0.25%     kworker/5:1H  [kernel.kallsyms]  [k]
>>> handle_irq_event_percpu
>>>     0.25%  Media D~ode #10  [kernel.kallsyms]  [k]
>>> handle_irq_event_percpu
>>>     0.19%  ImageDe~er #330  [kernel.kallsyms]  [k]
>>> handle_irq_event_percpu
>>>     0.07%       pulseaudio  [kernel.kallsyms]  [k]
>>> handle_irq_event_percpu
>>>
>>> The cycles were with -e cycles:pp, so I think that iret would have
>>> shown up if there were enough IRQs to cause the problem.
>>>
>>> I'll build a kernel with latencytop.
>>>
>>
>> I just caught call_rwsem_down_write_failed for 5379 ms in khugepaged
>> (holy crap) and radeon_fence_default_wait for 489.2ms in Xorg.
>>
>> Turning off THP gets rid of the khugepaged thing.  The 489.2ms is
>> radeon_fence_default_wait is amazingly reproducible -- I've seen that
>> exact number three times now.
>
>
> Sounds like the long stalls were THP, but the shorter ones might be radeon?
>
> Can you get some call graphs for the profile or from latencytop? Make sure
> at least the kernel is built with frame pointers (CONFIG_FRAME_POINTER=y),
> preferably also userspace (-fno-omit-frame-pointer).

Will try next week.  I'm out of town.

--Andy

>
>
>
> --
> Earthling Michel Dänzer               |               http://www.amd.com
> Libre software enthusiast             |             Mesa and X developer

-- 
Andy Lutomirski
AMA Capital Management, LLC