[Mesa-dev] [PATCH] r600g: fix abysmal performance in Reaction Quake

Mon Nov 12 02:08:06 PST 2012

On Sam, 2012-11-10 at 16:52 +0100, Marek Olšák wrote: 
> On Fri, Nov 9, 2012 at 9:44 PM, Jerome Glisse <j.glisse at gmail.com> wrote:
> > On Thu, Nov 01, 2012 at 03:13:31AM +0100, Marek Olšák wrote:
> >> On Thu, Nov 1, 2012 at 2:13 AM, Alex Deucher <alexdeucher at gmail.com> wrote:
> >> > On Wed, Oct 31, 2012 at 8:05 PM, Marek Olšák <maraeo at gmail.com> wrote:
> >> >> The problem was we set VRAM|GTT for relocations of STATIC resources.
> >> >> Setting just VRAM increases the framerate 4 times on my machine.
> >> >>
> >> >> I rewrote the switch statement and adjusted the domains for window
> >> >> framebuffers too.
> >> >
> >> > Reviewed-by: Alex Deucher <alexander.deucher at amd.com>
> >> >
> >> > Stable branches?
> >>
> >> Yes, good idea.
> >>
> >> Marek
> >
> > Btw as a follow up on this, i did some experiment with ttm and eviction.
> > Blocking any vram eviction improve average fps (20-30%) and minimum fps
> > (40-60%) but it diminish maximum fps (100%). Overall blocking eviction
> > just make framerate more consistant.
> >
> > I then tried several heuristic on the eviction process (not evicting buffer
> > if buffer was use in the last 1ms, 10ms, 20ms ..., sorting lru differently
> > btw buffer used for rendering and auxiliary buffer use by kernel, ...
> > none of those heuristic improved anything. I also removed bo wait in the
> > eviction pipeline but still no improvement. Haven't time to look further
> > but anyway bottom line is that some benchmark are memory tight and constant
> > eviction hurt.
> >
> > (used unigine heaven and reaction quake for benchmark)
> 
> I've came up with the following solution, which I think would help
> improve the situation a lot.
> 
> We should prepare a list of command streams and one list of
> relocations for an entire frame, do buffer validation/placements for
> the entire frame at the beginning and then just render the whole frame
> (schedule all the command streams at once). That would minimize the
> buffer evictions and give us the ideal buffer placements for the whole
> frame and then the GPU would run the commands uninterrupted by other
> processes (and we don't have to flush caches so much).
> 
> The only downsides are:
> - Buffers would be marked as "busy" for the entire frame, because the
> fence would only be at the end of the frame. We definitely need more
> fine-grained distribution of fences for apps which map buffers during
> rendering. One possible solution is to let userspace emit fences by
> itself and associate the fences with the buffers in the relocation
> list. The bo-wait mechanism would then use the fence from the (buffer,
> fence) pair, while TTM would use the end-of-frame fence (we can't
> trust the userspace giving us the right fences).
> - We should find out how to offload flushing and SwapBuffers to
> another thread, because the final CS ioctl will be really big.
> Currently, the radeon winsys doesn't offload the CS ioctl if it's in
> the SwapBuffers call.

- Deferring to a single big flush like that might introduce additional
latency before the GPU starts processing a frame and hurt some apps.

> Possible improvement:
> - The userspace should emit commands into a GPU buffer and not in the
> user memory, so that we don't have to do copy_from_user in the kernel.
> I expect the CS ioctl to unmap the GPU buffer and forbid later mapping
> as well as putting the buffer in the relocation list.

Unmapping etc. shouldn't be necessary in the long run with GPUVM.

-- 
Earthling Michel Dänzer           |                   http://www.amd.com
Libre software enthusiast         |          Debian, X and DRI developer