[Mesa-dev] [PATCH] r600g: fix abysmal performance in Reaction Quake

Mon Nov 12 04:45:17 PST 2012

On Mon, Nov 12, 2012 at 12:23 PM, Christian König
<deathsimple at vodafone.de> wrote:
> On 12.11.2012 11:08, Michel Dänzer wrote:
>>
>> On Sam, 2012-11-10 at 16:52 +0100, Marek Olšák wrote:
>>>
>>> On Fri, Nov 9, 2012 at 9:44 PM, Jerome Glisse <j.glisse at gmail.com> wrote:
>>>>
>>>> On Thu, Nov 01, 2012 at 03:13:31AM +0100, Marek Olšák wrote:
>>>>>
>>>>> On Thu, Nov 1, 2012 at 2:13 AM, Alex Deucher <alexdeucher at gmail.com>
>>>>> wrote:
>>>>>>
>>>>>> On Wed, Oct 31, 2012 at 8:05 PM, Marek Olšák <maraeo at gmail.com> wrote:
>>>>>>>
>>>>>>> The problem was we set VRAM|GTT for relocations of STATIC resources.
>>>>>>> Setting just VRAM increases the framerate 4 times on my machine.
>>>>>>>
>>>>>>> I rewrote the switch statement and adjusted the domains for window
>>>>>>> framebuffers too.
>>>>>>
>>>>>> Reviewed-by: Alex Deucher <alexander.deucher at amd.com>
>>>>>>
>>>>>> Stable branches?
>>>>>
>>>>> Yes, good idea.
>>>>>
>>>>> Marek
>>>>
>>>> Btw as a follow up on this, i did some experiment with ttm and eviction.
>>>> Blocking any vram eviction improve average fps (20-30%) and minimum fps
>>>> (40-60%) but it diminish maximum fps (100%). Overall blocking eviction
>>>> just make framerate more consistant.
>>>>
>>>> I then tried several heuristic on the eviction process (not evicting
>>>> buffer
>>>> if buffer was use in the last 1ms, 10ms, 20ms ..., sorting lru
>>>> differently
>>>> btw buffer used for rendering and auxiliary buffer use by kernel, ...
>>>> none of those heuristic improved anything. I also removed bo wait in the
>>>> eviction pipeline but still no improvement. Haven't time to look further
>>>> but anyway bottom line is that some benchmark are memory tight and
>>>> constant
>>>> eviction hurt.
>>>>
>>>> (used unigine heaven and reaction quake for benchmark)
>>>
>>> I've came up with the following solution, which I think would help
>>> improve the situation a lot.
>>>
>>> We should prepare a list of command streams and one list of
>>> relocations for an entire frame, do buffer validation/placements for
>>> the entire frame at the beginning and then just render the whole frame
>>> (schedule all the command streams at once). That would minimize the
>>> buffer evictions and give us the ideal buffer placements for the whole
>>> frame and then the GPU would run the commands uninterrupted by other
>>> processes (and we don't have to flush caches so much).
>>>
>>> The only downsides are:
>>> - Buffers would be marked as "busy" for the entire frame, because the
>>> fence would only be at the end of the frame. We definitely need more
>>> fine-grained distribution of fences for apps which map buffers during
>>> rendering. One possible solution is to let userspace emit fences by
>>> itself and associate the fences with the buffers in the relocation
>>> list. The bo-wait mechanism would then use the fence from the (buffer,
>>> fence) pair, while TTM would use the end-of-frame fence (we can't
>>> trust the userspace giving us the right fences).
>>> - We should find out how to offload flushing and SwapBuffers to
>>> another thread, because the final CS ioctl will be really big.
>>> Currently, the radeon winsys doesn't offload the CS ioctl if it's in
>>> the SwapBuffers call.
>>
>> - Deferring to a single big flush like that might introduce additional
>> latency before the GPU starts processing a frame and hurt some apps.
>
>
> Instead of fencing the buffers in userspace how about something like this
> for the kernel CS interface:
>
> RADEON_CHUNK_ID_IB
> RADEON_CHUNK_ID_IB
> RADEON_CHUNK_ID_IB
> RADEON_CHUNK_ID_IB
> RADEON_CHUNK_ID_RELOCS
> RADEON_CHUNK_ID_IB
> RADEON_CHUNK_ID_IB
> RADEON_CHUNK_ID_RELOCS
> RADEON_CHUNK_ID_IB
> RADEON_CHUNK_ID_IB
> RADEON_CHUNK_ID_RELOCS
> RADEON_CHUNK_ID_FLAGS
>
> Fences are only emitted at RADEON_CHUNK_ID_RELOCS borders, but the whole CS
> call is submitted as one single chunk of work and so all BOs get reserved
> and placed at once. That of course doesn't help with the higher latency
> before actually starting a frame, but I don't think that this would actually
> be such a big problem.

The latency can add an input lag, which can negatively impact game
experience especially in first person shooters.

In the long run I think Radeon/TTM should plan buffer moves across
several frames, i.e. an incremental approach that should eventually
converge to the ideal state, i.e. a process generating the highest GPU
load should get much better buffer placements than the idling
processes after like 10-20 CS ioctls and with a guarantee that its
buffers won't be evicted anytime soon.

Also I think the domains in the relocation list should be ignored
completely and only the initial domain from GEM_CREATE should be taken
into account, because that's the domain the gallium driver wanted in
the first place.

Marek