[Mesa-dev] [PATCH] r600g: fix abysmal performance in Reaction Quake

Mon Nov 12 03:23:58 PST 2012

On 12.11.2012 11:08, Michel Dänzer wrote:
> On Sam, 2012-11-10 at 16:52 +0100, Marek Olšák wrote:
>> On Fri, Nov 9, 2012 at 9:44 PM, Jerome Glisse <j.glisse at gmail.com> wrote:
>>> On Thu, Nov 01, 2012 at 03:13:31AM +0100, Marek Olšák wrote:
>>>> On Thu, Nov 1, 2012 at 2:13 AM, Alex Deucher <alexdeucher at gmail.com> wrote:
>>>>> On Wed, Oct 31, 2012 at 8:05 PM, Marek Olšák <maraeo at gmail.com> wrote:
>>>>>> The problem was we set VRAM|GTT for relocations of STATIC resources.
>>>>>> Setting just VRAM increases the framerate 4 times on my machine.
>>>>>>
>>>>>> I rewrote the switch statement and adjusted the domains for window
>>>>>> framebuffers too.
>>>>> Reviewed-by: Alex Deucher <alexander.deucher at amd.com>
>>>>>
>>>>> Stable branches?
>>>> Yes, good idea.
>>>>
>>>> Marek
>>> Btw as a follow up on this, i did some experiment with ttm and eviction.
>>> Blocking any vram eviction improve average fps (20-30%) and minimum fps
>>> (40-60%) but it diminish maximum fps (100%). Overall blocking eviction
>>> just make framerate more consistant.
>>>
>>> I then tried several heuristic on the eviction process (not evicting buffer
>>> if buffer was use in the last 1ms, 10ms, 20ms ..., sorting lru differently
>>> btw buffer used for rendering and auxiliary buffer use by kernel, ...
>>> none of those heuristic improved anything. I also removed bo wait in the
>>> eviction pipeline but still no improvement. Haven't time to look further
>>> but anyway bottom line is that some benchmark are memory tight and constant
>>> eviction hurt.
>>>
>>> (used unigine heaven and reaction quake for benchmark)
>> I've came up with the following solution, which I think would help
>> improve the situation a lot.
>>
>> We should prepare a list of command streams and one list of
>> relocations for an entire frame, do buffer validation/placements for
>> the entire frame at the beginning and then just render the whole frame
>> (schedule all the command streams at once). That would minimize the
>> buffer evictions and give us the ideal buffer placements for the whole
>> frame and then the GPU would run the commands uninterrupted by other
>> processes (and we don't have to flush caches so much).
>>
>> The only downsides are:
>> - Buffers would be marked as "busy" for the entire frame, because the
>> fence would only be at the end of the frame. We definitely need more
>> fine-grained distribution of fences for apps which map buffers during
>> rendering. One possible solution is to let userspace emit fences by
>> itself and associate the fences with the buffers in the relocation
>> list. The bo-wait mechanism would then use the fence from the (buffer,
>> fence) pair, while TTM would use the end-of-frame fence (we can't
>> trust the userspace giving us the right fences).
>> - We should find out how to offload flushing and SwapBuffers to
>> another thread, because the final CS ioctl will be really big.
>> Currently, the radeon winsys doesn't offload the CS ioctl if it's in
>> the SwapBuffers call.
> - Deferring to a single big flush like that might introduce additional
> latency before the GPU starts processing a frame and hurt some apps.

Instead of fencing the buffers in userspace how about something like 
this for the kernel CS interface:

RADEON_CHUNK_ID_IB
RADEON_CHUNK_ID_IB
RADEON_CHUNK_ID_IB
RADEON_CHUNK_ID_IB
RADEON_CHUNK_ID_RELOCS
RADEON_CHUNK_ID_IB
RADEON_CHUNK_ID_IB
RADEON_CHUNK_ID_RELOCS
RADEON_CHUNK_ID_IB
RADEON_CHUNK_ID_IB
RADEON_CHUNK_ID_RELOCS
RADEON_CHUNK_ID_FLAGS

Fences are only emitted at RADEON_CHUNK_ID_RELOCS borders, but the whole 
CS call is submitted as one single chunk of work and so all BOs get 
reserved and placed at once. That of course doesn't help with the higher 
latency before actually starting a frame, but I don't think that this 
would actually be such a big problem.

>> Possible improvement:
>> - The userspace should emit commands into a GPU buffer and not in the
>> user memory, so that we don't have to do copy_from_user in the kernel.
>> I expect the CS ioctl to unmap the GPU buffer and forbid later mapping
>> as well as putting the buffer in the relocation list.
> Unmapping etc. shouldn't be necessary in the long run with GPUVM.

We already have patches in internal review that allows userspace to 
submit IBs without any CS checking and so can avoid the whole 
copy_from_user and check overhead. So don't worry to much about this 
problem.

Christian.