[PATCH] drm/radeon: make 64bit fences more robust

Mon Sep 10 08:32:13 PDT 2012

On Mon, Sep 10, 2012 at 8:02 AM, Christian König
<deathsimple at vodafone.de> wrote:
> On 10.09.2012 13:12, Michel Dänzer wrote:
>>
>> On Mon, 2012-09-10 at 11:13 +0200, Christian König wrote:
>>>
>>> Only increase the higher 32bits if we really detect a wrap around.
>>>
>>> Fixes:
>>> https://bugs.freedesktop.org/show_bug.cgi?id=54129
>>> https://bugs.freedesktop.org/show_bug.cgi?id=54662
>>>
>>> Possible fixes:
>>> https://bugzilla.redhat.com/show_bug.cgi?id=846505
>>> https://bugzilla.redhat.com/show_bug.cgi?id=845639
>>>
>>> Signed-off-by: Christian König <deathsimple at vodafone.de>
>>> Cc: stable at vger.kernel.org
>>> ---
>>>   drivers/gpu/drm/radeon/radeon_fence.c |    6 +++---
>>>   1 file changed, 3 insertions(+), 3 deletions(-)
>>>
>>> diff --git a/drivers/gpu/drm/radeon/radeon_fence.c
>>> b/drivers/gpu/drm/radeon/radeon_fence.c
>>> index 7b737b9..4781e13 100644
>>> --- a/drivers/gpu/drm/radeon/radeon_fence.c
>>> +++ b/drivers/gpu/drm/radeon/radeon_fence.c
>>> @@ -160,7 +160,7 @@ void radeon_fence_process(struct radeon_device *rdev,
>>> int ring)
>>>         do {
>>>                 seq = radeon_fence_read(rdev, ring);
>>>                 seq |= last_seq & 0xffffffff00000000LL;
>>> -               if (seq < last_seq) {
>>> +               if (seq < (last_seq - 0x80000000LL)) {
>>>                         seq += 0x100000000LL;
>>>                 }
>>
>> Can you provide a bit more explanation for this change? In particular,
>> how could the code previously detect a wraparound when there was none,
>> and why is this the proper fix?
>
>
> Honestly I also don't really understand how this bug happened in the first
> place.
>
> We extend the 32bit fences supported by hardware by testing if a previously
> read fence value is smaller than the value we read now:
>
>>                 if (seq < last_seq) {
>
>
> But the problem seems to be that on some systems we do get fence values that
> are decreasing, e.g. instead of 5, 6, 7, 8 we get 5, 7, 6, 8 (or maybe 5, 6,
> 0, 7, 8 because somebody accidentally overwrites the fence value).
>
> It might be related to a hardware bug, or the algorithm is flawed in a way I
> currently don't see. Anyway the old code we had wasn't so picky about such
> problems and the patch just tries to make the current code as robust as the
> old code was, which indeed seems to solve the problems we see.
>
> The wrap around detection still works (tested by setting the initial fence
> value to 0xfffffff0 and letting it wrap around shortly after start), so I
> think it we can safely commit this.
>
> Christian.
>

If fence read ever gave the value 0 then your patch is postponing the
issue until last fence reach >= 0x1 8000 0001 (which will takes
month/years of uptime to happen :)). Having a log of all emitted fence
value and all read en fence value would probably be helpful so we have
a better clue of what's going on.

Sadly i don't think we don't receive fence in order but rather that we
sometimes receive a 0 value fence, if we did receive fence out of
order than the issue would also have happen with previous code.

Otherwise patch is ack but please add a better comment about fence
value most likely being 0 for unknow reasons or receive out of order
(thought i don't think so).

Cheers,
Jerome