[PATCH] drm/radeon: Avoid double gpu reset by adding a timeout on IB ring tests.
Christian König
deathsimple at vodafone.de
Wed Feb 3 12:24:49 UTC 2016
Am 31.01.2016 um 19:50 schrieb Matthew Dawson:
> On Sunday, January 24, 2016 10:49:00 AM EST Christian König wrote:
>> Am 24.01.2016 um 07:18 schrieb Matthew Dawson:
>>> When the radeon driver resets a gpu, it attempts to test whether all the
>>> rings can successfully handle an IB. If these rings fail to respond, the
>>> process will wait forever. Another gpu reset can't happen at this point,
>>> as the current reset holds a lock required to do so. Instead, make all
>>> the IB tests run with a timeout, so the system can attempt to recover
>>> in this case.
>>>
>>> While this doesn't fix the underlying issue with card resets failing, it
>>> gives the system a higher chance of recovering. These timeouts have been
>>> confirmed to help both a Tathi and Hawaii card recover after a gpu reset.
>>>
>>> I haven't been able to test this on other cards, but everything compiles.
>>> I wasn't sure what to use for a timeout value, so for now I've used
>>> radeon_lockup_timeout. When that is 0, the timeout functionality in this
>>> patch is also disabled.
>>>
>>> This also adds a new function, radeon_fence_wait_timeout, that behaves
>>> like
>>> radeon_fence_wait except allowing a different timeout value to be passed
>>> to
>>> radeon_fence_wait_seq_timeout.
>>>
>>> Signed-off-by: Matthew Dawson <matthew at mjdsystems.ca>
>> Really good idea, but please don't use radeon_lockup_timeout for this
>> cause the 10 seconds default of that is way to long. Instead just use a
>> fixed, let's say 100ms timeout defined in radeon.h.
> I originally tried 100ms, but I found that to be too short (my system would
> just lockup completely (no network even)). I found 1s is long enough, and
> isn't so long to be annoying. Would that be ok?
Yeah, 1s works as well. But that 100ms shouldn't be enough sounds like
another bug to me.
>
>> Also don't define our own radeon_fence_wait_timeout, just use
>> fence_wait_timeout(&fence->base, RADEON_IB_TEST_TIMEOUT) for this.
> I switched everything over to this, however the ib test always times out after
> a gpu reset (on startup, everything is normal). I'm not sure why
> fence_wait_timeout isn't being signaled while radeon_fence_wait_seq_timeout
> is. I'm suspicious that either radeon_fence_enable_signaling is not doing
> something necessary to get the fence completion to actual signal
> radeon_fence_default_wait, or because we hold the exclusive lock during a
> reset radeon_fence_check_lockup never runs and thus never triggers the fence
> or enables some irq lines.
>
> Is it worth digging through the interactions here to get dma-buf fences to
> work correctly during a lockup, or would it be better to just keep
> radeon_fence_wait_timeout? If option 1 is preferred, I'm happy to learn but I
> need some help learning how it is supposed to work.
In this case feel free to add the radeon_fence_wait_timeout. It's
probably a good idea to get rid of the exclusive lock sooner or later in
radeon as well, but that really is a long term project.
If you're really bored I can give you tons of input where to start with
that.
Regards,
Christian.
>
>> Apart from that the idea looks really good to me.
>>
>> Regards,
>> Christian.
>>
> Thanks,
>
> Matthew
>
>>> ---
>>>
>>> drivers/gpu/drm/radeon/cik_sdma.c | 3 ++-
>>> drivers/gpu/drm/radeon/r100.c | 3 ++-
>>> drivers/gpu/drm/radeon/r600.c | 3 ++-
>>> drivers/gpu/drm/radeon/r600_dma.c | 3 ++-
>>> drivers/gpu/drm/radeon/radeon.h | 1 +
>>> drivers/gpu/drm/radeon/radeon_fence.c | 25 ++++++++++++++++++++++---
>>> drivers/gpu/drm/radeon/radeon_vce.c | 3 ++-
>>> drivers/gpu/drm/radeon/uvd_v1_0.c | 3 ++-
>>> 8 files changed, 35 insertions(+), 9 deletions(-)
>>>
>>> diff --git a/drivers/gpu/drm/radeon/cik_sdma.c
>>> b/drivers/gpu/drm/radeon/cik_sdma.c index d16f2ee..014f5ca 100644
>>> --- a/drivers/gpu/drm/radeon/cik_sdma.c
>>> +++ b/drivers/gpu/drm/radeon/cik_sdma.c
>>> @@ -737,7 +737,8 @@ int cik_sdma_ib_test(struct radeon_device *rdev,
>>> struct radeon_ring *ring)>
>>> DRM_ERROR("radeon: failed to schedule ib (%d).\n", r);
>>> return r;
>>>
>>> }
>>>
>>> - r = radeon_fence_wait(ib.fence, false);
>>> + r = radeon_fence_wait_timeout(ib.fence, false, msecs_to_jiffies(
>>> + (radeon_lockup_timeout) ? radeon_lockup_timeout :
>>> MAX_SCHEDULE_TIMEOUT));>
>>> if (r) {
>>>
>>> DRM_ERROR("radeon: fence wait failed (%d).\n", r);
>>> return r;
>>>
>>> diff --git a/drivers/gpu/drm/radeon/r100.c b/drivers/gpu/drm/radeon/r100.c
>>> index 5eae0a8..5386c09 100644
>>> --- a/drivers/gpu/drm/radeon/r100.c
>>> +++ b/drivers/gpu/drm/radeon/r100.c
>>> @@ -3732,7 +3732,8 @@ int r100_ib_test(struct radeon_device *rdev, struct
>>> radeon_ring *ring)>
>>> DRM_ERROR("radeon: failed to schedule ib (%d).\n", r);
>>> goto free_ib;
>>>
>>> }
>>>
>>> - r = radeon_fence_wait(ib.fence, false);
>>> + r = radeon_fence_wait_timeout(ib.fence, false, msecs_to_jiffies(
>>> + (radeon_lockup_timeout) ? radeon_lockup_timeout :
>>> MAX_SCHEDULE_TIMEOUT));>
>>> if (r) {
>>>
>>> DRM_ERROR("radeon: fence wait failed (%d).\n", r);
>>> goto free_ib;
>>>
>>> diff --git a/drivers/gpu/drm/radeon/r600.c b/drivers/gpu/drm/radeon/r600.c
>>> index cc2fdf0..8fbe5d7 100644
>>> --- a/drivers/gpu/drm/radeon/r600.c
>>> +++ b/drivers/gpu/drm/radeon/r600.c
>>> @@ -3381,7 +3381,8 @@ int r600_ib_test(struct radeon_device *rdev, struct
>>> radeon_ring *ring)>
>>> DRM_ERROR("radeon: failed to schedule ib (%d).\n", r);
>>> goto free_ib;
>>>
>>> }
>>>
>>> - r = radeon_fence_wait(ib.fence, false);
>>> + r = radeon_fence_wait_timeout(ib.fence, false, msecs_to_jiffies(
>>> + (radeon_lockup_timeout) ? radeon_lockup_timeout :
>>> MAX_SCHEDULE_TIMEOUT));>
>>> if (r) {
>>>
>>> DRM_ERROR("radeon: fence wait failed (%d).\n", r);
>>> goto free_ib;
>>>
>>> diff --git a/drivers/gpu/drm/radeon/r600_dma.c
>>> b/drivers/gpu/drm/radeon/r600_dma.c index d2dd29a..3ff614d 100644
>>> --- a/drivers/gpu/drm/radeon/r600_dma.c
>>> +++ b/drivers/gpu/drm/radeon/r600_dma.c
>>> @@ -368,7 +368,8 @@ int r600_dma_ib_test(struct radeon_device *rdev,
>>> struct radeon_ring *ring)>
>>> DRM_ERROR("radeon: failed to schedule ib (%d).\n", r);
>>> return r;
>>>
>>> }
>>>
>>> - r = radeon_fence_wait(ib.fence, false);
>>> + r = radeon_fence_wait_timeout(ib.fence, false, msecs_to_jiffies(
>>> + (radeon_lockup_timeout) ? radeon_lockup_timeout :
>>> MAX_SCHEDULE_TIMEOUT));>
>>> if (r) {
>>>
>>> DRM_ERROR("radeon: fence wait failed (%d).\n", r);
>>> return r;
>>>
>>> diff --git a/drivers/gpu/drm/radeon/radeon.h
>>> b/drivers/gpu/drm/radeon/radeon.h index 5ae6db9..0b698b6 100644
>>> --- a/drivers/gpu/drm/radeon/radeon.h
>>> +++ b/drivers/gpu/drm/radeon/radeon.h
>>> @@ -382,6 +382,7 @@ void radeon_fence_driver_force_completion(struct
>>> radeon_device *rdev, int ring);>
>>> int radeon_fence_emit(struct radeon_device *rdev, struct radeon_fence
>>> **fence, int ring); void radeon_fence_process(struct radeon_device
>>> *rdev, int ring);
>>> bool radeon_fence_signaled(struct radeon_fence *fence);
>>>
>>> +int radeon_fence_wait_timeout(struct radeon_fence *fence, bool
>>> interruptible, long timeout);>
>>> int radeon_fence_wait(struct radeon_fence *fence, bool interruptible);
>>> int radeon_fence_wait_next(struct radeon_device *rdev, int ring);
>>> int radeon_fence_wait_empty(struct radeon_device *rdev, int ring);
>>>
>>> diff --git a/drivers/gpu/drm/radeon/radeon_fence.c
>>> b/drivers/gpu/drm/radeon/radeon_fence.c index 05815c4..9fec805 100644
>>> --- a/drivers/gpu/drm/radeon/radeon_fence.c
>>> +++ b/drivers/gpu/drm/radeon/radeon_fence.c
>>> @@ -527,7 +527,7 @@ static long radeon_fence_wait_seq_timeout(struct
>>> radeon_device *rdev,>
>>> }
>>>
>>> /**
>>>
>>> - * radeon_fence_wait - wait for a fence to signal
>>> + * radeon_fence_wait_timeout - wait for a fence to signal with timeout
>>>
>>> *
>>> * @fence: radeon fence object
>>> * @intr: use interruptible sleep
>>>
>>> @@ -535,9 +535,10 @@ static long radeon_fence_wait_seq_timeout(struct
>>> radeon_device *rdev,>
>>> * Wait for the requested fence to signal (all asics).
>>> * @intr selects whether to use interruptable (true) or
>>> non-interruptable
>>> * (false) sleep when waiting for the fence.
>>>
>>> + * @timeout: maximum time to wait, or MAX_SCHEDULE_TIMEOUT for infinite
>>> wait>
>>> * Returns 0 if the fence has passed, error for all other cases.
>>> */
>>>
>>> -int radeon_fence_wait(struct radeon_fence *fence, bool intr)
>>> +int radeon_fence_wait_timeout(struct radeon_fence *fence, bool intr, long
>>> timeout)>
>>> {
>>>
>>> uint64_t seq[RADEON_NUM_RINGS] = {};
>>> long r;
>>>
>>> @@ -552,9 +553,11 @@ int radeon_fence_wait(struct radeon_fence *fence,
>>> bool intr)>
>>> return fence_wait(&fence->base, intr);
>>>
>>> seq[fence->ring] = fence->seq;
>>>
>>> - r = radeon_fence_wait_seq_timeout(fence->rdev, seq, intr,
>>> MAX_SCHEDULE_TIMEOUT); + r = radeon_fence_wait_seq_timeout(fence->rdev,
>>> seq, intr, timeout);>
>>> if (r < 0) {
>>>
>>> return r;
>>>
>>> + } else if (r == 0) {
>>> + return -ETIMEDOUT;
>>>
>>> }
>>>
>>> r = fence_signal(&fence->base);
>>>
>>> @@ -564,6 +567,22 @@ int radeon_fence_wait(struct radeon_fence *fence,
>>> bool intr)>
>>> }
>>>
>>> /**
>>>
>>> + * radeon_fence_wait - wait for a fence to signal
>>> + *
>>> + * @fence: radeon fence object
>>> + * @intr: use interruptible sleep
>>> + *
>>> + * Wait for the requested fence to signal (all asics).
>>> + * @intr selects whether to use interruptable (true) or non-interruptable
>>> + * (false) sleep when waiting for the fence.
>>> + * Returns 0 if the fence has passed, error for all other cases.
>>> + */
>>> +int radeon_fence_wait(struct radeon_fence *fence, bool intr)
>>> +{
>>> + return radeon_fence_wait_timeout(fence, intr, MAX_SCHEDULE_TIMEOUT);
>>> +}
>>> +
>>> +/**
>>>
>>> * radeon_fence_wait_any - wait for a fence to signal on any ring
>>> *
>>> * @rdev: radeon device pointer
>>>
>>> diff --git a/drivers/gpu/drm/radeon/radeon_vce.c
>>> b/drivers/gpu/drm/radeon/radeon_vce.c index 7eb1ae7..7d80dad 100644
>>> --- a/drivers/gpu/drm/radeon/radeon_vce.c
>>> +++ b/drivers/gpu/drm/radeon/radeon_vce.c
>>> @@ -810,7 +810,8 @@ int radeon_vce_ib_test(struct radeon_device *rdev,
>>> struct radeon_ring *ring)>
>>> goto error;
>>>
>>> }
>>>
>>> - r = radeon_fence_wait(fence, false);
>>> + r = radeon_fence_wait_timeout(fence, false, msecs_to_jiffies(
>>> + (radeon_lockup_timeout) ? radeon_lockup_timeout :
>>> MAX_SCHEDULE_TIMEOUT));>
>>> if (r) {
>>>
>>> DRM_ERROR("radeon: fence wait failed (%d).\n", r);
>>>
>>> } else {
>>>
>>> diff --git a/drivers/gpu/drm/radeon/uvd_v1_0.c
>>> b/drivers/gpu/drm/radeon/uvd_v1_0.c index c6b1cbc..35caa89 100644
>>> --- a/drivers/gpu/drm/radeon/uvd_v1_0.c
>>> +++ b/drivers/gpu/drm/radeon/uvd_v1_0.c
>>> @@ -522,7 +522,8 @@ int uvd_v1_0_ib_test(struct radeon_device *rdev,
>>> struct radeon_ring *ring)>
>>> goto error;
>>>
>>> }
>>>
>>> - r = radeon_fence_wait(fence, false);
>>> + r = radeon_fence_wait_timeout(fence, false, msecs_to_jiffies(
>>> + (radeon_lockup_timeout) ? radeon_lockup_timeout :
>>> MAX_SCHEDULE_TIMEOUT));>
>>> if (r) {
>>>
>>> DRM_ERROR("radeon: fence wait failed (%d).\n", r);
>>> goto error;
>
More information about the dri-devel
mailing list