[PATCH] drm/radeon: add an exclusive lock for GPU reset
Christian König
deathsimple at vodafone.de
Tue Jul 3 02:26:43 PDT 2012
On 02.07.2012 19:27, Jerome Glisse wrote:
> On Mon, Jul 2, 2012 at 1:05 PM, Christian König <deathsimple at vodafone.de> wrote:
>> On 02.07.2012 18:41, Jerome Glisse wrote:
>>> On Mon, Jul 2, 2012 at 12:26 PM, Christian König
>>> <deathsimple at vodafone.de> wrote:
>>>> On 02.07.2012 17:39, j.glisse at gmail.com wrote:
>>>>> From: Jerome Glisse <jglisse at redhat.com>
>>>>>
>>>>> GPU reset need to be exclusive, one happening at a time. For this
>>>>> add a rw semaphore so that any path that trigger GPU activities
>>>>> have to take the semaphore as a reader thus allowing concurency.
>>>>>
>>>>> The GPU reset path take the semaphore as a writer ensuring that
>>>>> no concurrent reset take place.
>>>>>
>>>>> Signed-off-by: Jerome Glisse <jglisse at redhat.com>
>>>> NAK, that isn't as bad as the cs mutex was but still to complicated. It's
>>>> just too far up in the call stack, e.g. it tries to catch ioctl
>>>> operations,
>>>> instead of catching the underlying hardware operation which is caused by
>>>> the
>>>> ioctl/ttm/etc...
>>>>
>>>> Why not just take the ring look as I suggested?
>>>>
>>>>
>>> No we can't use ring lock, we need to protect suspend/resume path and
>>> we need an exclusive lock for that so we need a reset mutex at the
>>> very least. But instead of having a reset mutex i prefer using a rw
>>> lock so that we can stop ioctl until a reset goes through an let
>>> pending ioctl take proper action. Think about multiple context trying
>>> to reset GPU ...
>>>
>>> Really this is the best option, the rw locking wont induce any lock
>>> contention execept in gpu reset case which is anyway bad news.
>> Why? That makes no sense to me. Well I don't want to prevent lock
>> contention, but understand why we should add locking at the ioctl level.
>> That violates locking rule number one "lock data instead of code" (or in our
>> case "lock hardware access instead of code path") and it really is the
>> reason why we ended up with the cs_mutex protecting practically everything.
>>
>> Multiple context trying to reset the GPU should be pretty fine, current it
>> would just reset the GPU twice, but in the future asic_reset should be much
>> more fine grained and only reset the parts of the GPU which really needs an
>> reset.
>>
>> Cheers,
>> Christian.
> No multiple reset is not fine, try it your self and you will see all
> kind of oops (strongly advise you to sync you hd before stress testing
> that). Yes we need to protect code path because suspend/resume code
> path is special one it touch many data in the device structure. GPU
> reset is a rare event or should be a rare event.
Yeah, but I thought that fixing those oops as the second step. I see the
fact that suspend/resume is unpinning all the ttm memory and then
pinning it again as a bug that needs to be fixed. Or as an alternative
we could split the suspend/resume calls into suspend/disable and
resume/enable calls, where we only call disable/enable in the gpu_reset
path rather than a complete suspend/resume (not really sure about that).
Also a GPU reset isn't such a rare event, currently it just occurs when
userspace is doing something bad, for example submitting an invalid
shader or stuff like that. But with VM and IO protection coming into the
driver we are going to need a GPU reset when there is an protection
fault, and with that it is really desirable to just reset the hardware
in a way where we can say: This IB was faulty skip over it and resume
with whatever is after it on the ring.
And todo that we need to keep the auxiliary data like sub allocator
memory, blitting shader bo, and especially vm page tables at the same
place in hardware memory.
> I stress it we need at very least a mutex to protect gpu reset and i
> will stand on that position because there is no other way around.
> Using rw lock have a bonus of allowing proper handling of gpu reset
> failure and that what the patch i sent to linus fix tree is about, so
> to make drm next merge properly while preserving proper behavior in
> gpu reset failure the rw semaphore is the best option.
Oh well, I'm not arguing that we don't need a look here. I'm just
questioning to put it at the ioctl level (e.g. the driver entry from
userspace), that wasn't such a good idea with the cs_mutex and doesn't
seems like a good idea now. Instead we should place the looking between
the ioctl/ttm and the actual hardware submission and that brings it
pretty close (if not identically) to the ring mutex.
Cheers,
Christian.
More information about the dri-devel
mailing list