RFC: Radeon multi ring support branch

Tue Nov 15 16:24:17 PST 2011

2011/11/15 Christian König <deathsimple at vodafone.de>:
> On 15.11.2011 20:32, Jerome Glisse wrote:
>>
>> On Sat, Oct 29, 2011 at 03:00:28PM +0200, Christian König wrote:
>>>
>>> Hello everybody,
>>>
>>> to support multiple compute rings, async DMA engines and UVD we need
>>> to teach the radeon kernel module how to sync buffers between
>>> different rings and make some changes to the command submission
>>> ioctls.
>>>
>>> Since we can't release any documentation about async DMA or UVD
>>> (yet), my current branch concentrates on getting the additional
>>> compute rings on cayman running. Unfortunately those rings have
>>> hardware bugs that can't be worked around, so they are actually not
>>> very useful in a production environment, but they should do quite
>>> well for this testing purpose.
>>>
>>> The branch can be found here:
>>> http://cgit.freedesktop.org/~deathsimple/linux/log/
>>>
>>> Since some of the patches are quite intrusive, constantly rebaseing
>>> them could get a bit painful. So I would like to see most of the
>>> stuff included into drm-next, even if we don't make use of the new
>>> functionality right now.
>>>
>>> Comments welcome,
>>> Christian.
>>
>> So i have been looking back at all this and now there is somethings
>> puzzling me. So if semaphore wait for a non null value at gpu address
>> provided in the packet than the current implementation for the cs
>> ioctl doesn't work when there is more than 2 rings to sync.
>
> Semaphores are counters, so each signal operation is atomically incrementing
> the counter, while each wait operation is (atomically) checking if the
> counter is above zero and if it is decrement it. So you can signal/wait on
> the same address multiple times.

Yup i did understand that right.

>>
>> As it will use only one semaphore so first ring to finish will
>> mark the semaphore as done even if there is still other ring not
>> done.
>
> Nope, each wait operation will wait for a separate signal operation (at
> least I think so).

Well as we don't specify on which value semaphore should wait on, i am
prety sure the first ring to increment the semaphore will unblock all
waiter. So if you have ring1 that want to wait on ring2 and ring3 as
soon as ring2 or ring3 is done ring1 will go one while either ring2 or
ring3 might not be done. I will test that tomorrow but from doc i have
it seems so. Thus it will be broken with more than one ring, that
would mean you have to allocate one semaphore for each ring couple you
want to synchronize. Note that the usual case will likely be sync btw
2 ring.

>>
>> This all make me wonder if some change to cs ioctl would make
>> all this better. So idea of semaphore is to wait for some other
>> ring to finish something. So let say we have following scenario:
>> Application submit following to ring1: csA, csB
>> Application now submit to ring2: cs1, cs2
>> And application want csA to be done for cs1 and csB to be done
>> for cs2.
>>
>> To achieve such usage pattern we would need to return fence seq
>> or similar from the cs ioctl. So user application would know
>> ringid+fence_seq for csA&  csB and provide this when scheduling
>> cs1&  cs2. Here i am assuming MEM_WRITE/WAIT_REG_MEM packet
>> are as good as MEM_SEMAPHORE packet. Ie the semaphore packet
>> doesn't give us much more than MEM_WRITE/WAIT_REG_MEM would.
>>
>> To achieve that each ring got it's fence scratch addr where to
>> write seq number. And then we use WAIT_REG_MEM on this addr
>> and with the specific seq for the other ring that needs
>> synchronization. This would simplify the semaphore code as
>> we wouldn't need somethings new beside helper function and
>> maybe extending the fence structure.
>
> I played around with the same Idea before implementing the whole semaphore
> stuff, but the killer argument against it is that not all rings support the
> WAIT_REG_MEM command.
>
> Also the semaphores are much more efficient than the WAIT_REG_MEM command,
> because all semaphore commands from the different rings are send to a
> central semaphore block, so that constant polling, and with it constant
> memory access can be avoided. Additional to that the WAIT_REG_MEM command
> has a minimum latency of Wait_Interval*16 clock cycles, while semaphore need
> 4 clock cycles for the command and 4 clock cycles for the result, so it
> definitely has a much lower latency.

Yup that was my guess too that semaphore block optimize things

> We should also keep in mind that the semaphore block is not only capable to
> sync between different rings inside a single GPU, but can also communicate
> with another semaphore block in a second GPU. So it is a key part in a multi
> GPU environment.
>
>> Anyway i put updating ring patch at :
>> http://people.freedesktop.org/~glisse/mrings/
>> It rebased on top of linus tree and it has several space
>> indentation fixes and also a fix for no semaphore allocated
>> issue (patch 5)
>>
> Thanks, I will try to integrate the changes tomorrow.
>

After retesting the first patch  drm/radeon: fix debugfs handling is NAK
a complete no go.

Issue is that radeon_debugfs_cleanup is call after rdev is free. This
is why i used a static array. I forgot about that, i should have put a
comment. I guess you built your kernel without debugfs or that you
didn't tested to reload the module.

Cheers,
Jerome