[Mesa-dev] [PATCH] r600g: fix abysmal performance in Reaction Quake

Marek Olšák maraeo at gmail.com
Sat Nov 10 07:52:02 PST 2012


On Fri, Nov 9, 2012 at 9:44 PM, Jerome Glisse <j.glisse at gmail.com> wrote:
> On Thu, Nov 01, 2012 at 03:13:31AM +0100, Marek Olšák wrote:
>> On Thu, Nov 1, 2012 at 2:13 AM, Alex Deucher <alexdeucher at gmail.com> wrote:
>> > On Wed, Oct 31, 2012 at 8:05 PM, Marek Olšák <maraeo at gmail.com> wrote:
>> >> The problem was we set VRAM|GTT for relocations of STATIC resources.
>> >> Setting just VRAM increases the framerate 4 times on my machine.
>> >>
>> >> I rewrote the switch statement and adjusted the domains for window
>> >> framebuffers too.
>> >
>> > Reviewed-by: Alex Deucher <alexander.deucher at amd.com>
>> >
>> > Stable branches?
>>
>> Yes, good idea.
>>
>> Marek
>
> Btw as a follow up on this, i did some experiment with ttm and eviction.
> Blocking any vram eviction improve average fps (20-30%) and minimum fps
> (40-60%) but it diminish maximum fps (100%). Overall blocking eviction
> just make framerate more consistant.
>
> I then tried several heuristic on the eviction process (not evicting buffer
> if buffer was use in the last 1ms, 10ms, 20ms ..., sorting lru differently
> btw buffer used for rendering and auxiliary buffer use by kernel, ...
> none of those heuristic improved anything. I also removed bo wait in the
> eviction pipeline but still no improvement. Haven't time to look further
> but anyway bottom line is that some benchmark are memory tight and constant
> eviction hurt.
>
> (used unigine heaven and reaction quake for benchmark)

I've came up with the following solution, which I think would help
improve the situation a lot.

We should prepare a list of command streams and one list of
relocations for an entire frame, do buffer validation/placements for
the entire frame at the beginning and then just render the whole frame
(schedule all the command streams at once). That would minimize the
buffer evictions and give us the ideal buffer placements for the whole
frame and then the GPU would run the commands uninterrupted by other
processes (and we don't have to flush caches so much).

The only downsides are:
- Buffers would be marked as "busy" for the entire frame, because the
fence would only be at the end of the frame. We definitely need more
fine-grained distribution of fences for apps which map buffers during
rendering. One possible solution is to let userspace emit fences by
itself and associate the fences with the buffers in the relocation
list. The bo-wait mechanism would then use the fence from the (buffer,
fence) pair, while TTM would use the end-of-frame fence (we can't
trust the userspace giving us the right fences).
- We should find out how to offload flushing and SwapBuffers to
another thread, because the final CS ioctl will be really big.
Currently, the radeon winsys doesn't offload the CS ioctl if it's in
the SwapBuffers call.

Possible improvement:
- The userspace should emit commands into a GPU buffer and not in the
user memory, so that we don't have to do copy_from_user in the kernel.
I expect the CS ioctl to unmap the GPU buffer and forbid later mapping
as well as putting the buffer in the relocation list.

Marek


More information about the mesa-dev mailing list