[PATCH 2/3] drm/radeon: handle lockup in delayed work, v3

Maarten Lankhorst maarten.lankhorst at canonical.com
Mon Aug 18 08:28:36 PDT 2014


Op 18-08-14 om 17:12 schreef Christian König:
> Am 18.08.2014 um 16:45 schrieb Maarten Lankhorst:
>> Signed-off-by: Maarten Lankhorst <maarten.lankhorst at canonical.com>
>> ---
>> V1 had a nasty bug breaking gpu lockup recovery. The fix is not
>> allowing radeon_fence_driver_check_lockup to take exclusive_lock,
>> and kill it during lockup recovery instead.
>> V2 used delayed work that ran during lockup recovery, but required
>> read lock. I've fixed this by downgrading the write, and retrying if
>> recovery fails.
>> ---
>>   drivers/gpu/drm/radeon/radeon.h       |   2 +
>>   drivers/gpu/drm/radeon/radeon_fence.c | 115 +++++++++++++++++-----------------
>>   2 files changed, 61 insertions(+), 56 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/radeon/radeon.h b/drivers/gpu/drm/radeon/radeon.h
>> index 1d806983ec7b..29504efe8971 100644
>> --- a/drivers/gpu/drm/radeon/radeon.h
>> +++ b/drivers/gpu/drm/radeon/radeon.h
>> @@ -355,6 +355,8 @@ struct radeon_fence_driver {
>>       uint64_t            sync_seq[RADEON_NUM_RINGS];
>>       atomic64_t            last_seq;
>>       bool                initialized;
>> +    struct delayed_work        fence_check_work;
>> +    struct radeon_device        *rdev;
>
> Put the reference to the device as the first field in the structure.
>
>>   };
>>     struct radeon_fence {
>> diff --git a/drivers/gpu/drm/radeon/radeon_fence.c b/drivers/gpu/drm/radeon/radeon_fence.c
>> index 913787085dfa..94eca53d99f8 100644
>> --- a/drivers/gpu/drm/radeon/radeon_fence.c
>> +++ b/drivers/gpu/drm/radeon/radeon_fence.c
>> @@ -125,16 +125,7 @@ int radeon_fence_emit(struct radeon_device *rdev,
>>       return 0;
>>   }
>>   -/**
>> - * radeon_fence_process - process a fence
>> - *
>> - * @rdev: radeon_device pointer
>> - * @ring: ring index the fence is associated with
>> - *
>> - * Checks the current fence value and wakes the fence queue
>> - * if the sequence number has increased (all asics).
>> - */
>> -void radeon_fence_process(struct radeon_device *rdev, int ring)
>> +static bool __radeon_fence_process(struct radeon_device *rdev, int ring)
>
> Don't use "__" for internal radeon function names and especially don't remove the function documentation.
I've moved the documentation to the new radeon_fence_process function that calls the __radeon_fence_process one,
which is an internal function that is only used by the driver. I guess I can rename it to __radeon_fence_process_nowake or something instead,
but documentation doesn't make sense since it's not used outside of radeon_fence.c
>>   {
>>       uint64_t seq, last_seq, last_emitted;
>>       unsigned count_loop = 0;
>> @@ -190,7 +181,53 @@ void radeon_fence_process(struct radeon_device *rdev, int ring)
>>           }
>>       } while (atomic64_xchg(&rdev->fence_drv[ring].last_seq, seq) > seq);
>>   -    if (wake)
>> +    if (seq < last_emitted && !rdev->in_reset)
>> +        mod_delayed_work(system_power_efficient_wq,
>> +                 &rdev->fence_drv[ring].fence_check_work,
>> +                 RADEON_FENCE_JIFFIES_TIMEOUT);
>
> Am I wrong or do you queue the work only after radeon_fence_process is called for the first time?
>
> Might be a good idea to have an explicit queue_delayed_work in radeon_fence_emit as well.
Yeah might as well, with these changes it makes sense to run it as soon as possible.

But if there are no waiters, there's no real need. It's a tree falling in a forest with no-one around to hear it. ;-)



More information about the dri-devel mailing list