Radeon lockup on 3.8.5-201.fc18.x86_64
Andy Lutomirski
luto at amacapital.net
Tue Apr 23 12:31:02 PDT 2013
On Tue, Apr 23, 2013 at 10:15 AM, Michel Dänzer <michel at daenzer.net> wrote:
> On Die, 2013-04-23 at 10:08 -0700, Andy Lutomirski wrote:
>> On Mon, Apr 22, 2013 at 10:55 PM, Michel Dänzer <michel at daenzer.net> wrote:
>> > On Mon, 2013-04-22 at 16:19 -0700, Andy Lutomirski wrote:
>> >
>> >> I'm not convinced there's an actual hang. 40 seconds is a long time,
>> >> and I've only ever seen this when clicking something, and when this
>> >> happens, the screen goes blank immediately (not after a 40 second
>> >> delay).
>> >
>> > Hmm, now that you mention this, I notice in your original report it
>> > claims that the CP stalled for 'more than 5102593msec', which is clearly
>> > bogus. Looks like something's wrong with the lockup detection.
>> > Did this start after a kernel update or something like that?
>>
>> It's recent. It may have been when F18 switched from 3.7 to 3.8.
>
> Can you reproduce it with an upstream kernel? Can you bisect? I realize
> it'll probably take a long time, but unless someone has an idea which
> change might have introduced the problem...
Yuck. I can try, but it takes days to reproduce this, so it will take
forever (and may end up with a wrong answer if I get lucky and don't
crash).
>
>
>> I think there are bugs in the lockup detection and in the lockup
>> recovery. Firefox, in particular, is *really* slow afterwards. Are
>> interrupts possibly getting dropped or misconfigured during the reset?
>
> Let's not get ahead of ourselves and focus on the lockup detection issue
> for now.
I don't understand the r600_gpu_check_soft_reset code, but could this
be the sequence of events that triggers it?
1. radeon_ring_is_lockup is called just as the very last command on
the ring completes, so last_rptr gets set to the rptr.
2. Nothing happens for a while (i.e. > lockup_timeout). rptr doesn't change.
3. A very slightly slow operation starts.
4. radeon_ring_is_lockup gets called before that command completes.
radeon_ring_test_lockup will not detect a jiffies wrap-around (because
there wasn't one), rptr will equal last_rptr (because there hasn't
been any progress since last time), and the elapsed time will be
really long, because the function hasn't been called for a long time.
So a lockup gets detected, even though nothing's wrong.
There's a comment above radeon_ring_test_lockup that says:
* A possible false positivie is if we get call after while and last_cp_rptr ==
* the current CP rptr, even if it's unlikely it might happen. To avoid this
* if the elapsed time since last call is bigger than 2 second than we return
* false and update the tracking information. Due to this the caller must call
* radeon_ring_test_lockup several time in less than 2sec for lockup
to be reported
* the fencing code should be cautious about that.
but the corresponding code doesn't appear to exist anywhere.
Also, and unrelatedly, I revoke my comment about gmail issues being
fixed with hyperz off. Gmail still draws incorrectly. This may or
may not have anything to do with the radeon driver.
--Andy
More information about the dri-devel
mailing list