[Mesa-dev] [PATCH] r600g: fix abysmal performance in Reaction Quake
Alex Deucher
alexdeucher at gmail.com
Sat Nov 10 08:47:50 PST 2012
On Sat, Nov 10, 2012 at 10:52 AM, Marek Olšák <maraeo at gmail.com> wrote:
> On Fri, Nov 9, 2012 at 9:44 PM, Jerome Glisse <j.glisse at gmail.com> wrote:
>> On Thu, Nov 01, 2012 at 03:13:31AM +0100, Marek Olšák wrote:
>>> On Thu, Nov 1, 2012 at 2:13 AM, Alex Deucher <alexdeucher at gmail.com> wrote:
>>> > On Wed, Oct 31, 2012 at 8:05 PM, Marek Olšák <maraeo at gmail.com> wrote:
>>> >> The problem was we set VRAM|GTT for relocations of STATIC resources.
>>> >> Setting just VRAM increases the framerate 4 times on my machine.
>>> >>
>>> >> I rewrote the switch statement and adjusted the domains for window
>>> >> framebuffers too.
>>> >
>>> > Reviewed-by: Alex Deucher <alexander.deucher at amd.com>
>>> >
>>> > Stable branches?
>>>
>>> Yes, good idea.
>>>
>>> Marek
>>
>> Btw as a follow up on this, i did some experiment with ttm and eviction.
>> Blocking any vram eviction improve average fps (20-30%) and minimum fps
>> (40-60%) but it diminish maximum fps (100%). Overall blocking eviction
>> just make framerate more consistant.
>>
>> I then tried several heuristic on the eviction process (not evicting buffer
>> if buffer was use in the last 1ms, 10ms, 20ms ..., sorting lru differently
>> btw buffer used for rendering and auxiliary buffer use by kernel, ...
>> none of those heuristic improved anything. I also removed bo wait in the
>> eviction pipeline but still no improvement. Haven't time to look further
>> but anyway bottom line is that some benchmark are memory tight and constant
>> eviction hurt.
>>
>> (used unigine heaven and reaction quake for benchmark)
>
> I've came up with the following solution, which I think would help
> improve the situation a lot.
>
> We should prepare a list of command streams and one list of
> relocations for an entire frame, do buffer validation/placements for
> the entire frame at the beginning and then just render the whole frame
> (schedule all the command streams at once). That would minimize the
> buffer evictions and give us the ideal buffer placements for the whole
> frame and then the GPU would run the commands uninterrupted by other
> processes (and we don't have to flush caches so much).
>
Another possibility would be to allocate a small number of very large
buffers and then sub-allocate in the 3D driver. That should alleviate
some of the overhead in dealing with lots of small buffers in ttm and
also reduce fragmentation.
Alex
> The only downsides are:
> - Buffers would be marked as "busy" for the entire frame, because the
> fence would only be at the end of the frame. We definitely need more
> fine-grained distribution of fences for apps which map buffers during
> rendering. One possible solution is to let userspace emit fences by
> itself and associate the fences with the buffers in the relocation
> list. The bo-wait mechanism would then use the fence from the (buffer,
> fence) pair, while TTM would use the end-of-frame fence (we can't
> trust the userspace giving us the right fences).
> - We should find out how to offload flushing and SwapBuffers to
> another thread, because the final CS ioctl will be really big.
> Currently, the radeon winsys doesn't offload the CS ioctl if it's in
> the SwapBuffers call.
>
> Possible improvement:
> - The userspace should emit commands into a GPU buffer and not in the
> user memory, so that we don't have to do copy_from_user in the kernel.
> I expect the CS ioctl to unmap the GPU buffer and forbid later mapping
> as well as putting the buffer in the relocation list.
>
> Marek
More information about the mesa-dev
mailing list