[RFC] drm/ttm: add minimum residency constraint for bo eviction

Wed Nov 28 13:51:46 PST 2012

I think the problem with Radeon/TTM is much deeper. Let me demonstrate
it on the following example.

Unigine Heaven needs about 385MB of space for static resources, that's
only 75% of my 512MB card. Yet, TTM is not capable of getting all of
that into VRAM. If I allow GTT placements, I get 20 fps, which is the
old Mesa behavior. If I force VRAM placements, I get 3 fps, because we
validate buffers 10 times per frame and there's probably a lot of
buffer evictions during each validation.

In theory, we should get the best performance if Radeon/TTM managed to
get everything into VRAM. That's what fglrx probably does. 75% of VRAM
doesn't look like too much. And that's the problem. Even if we
seemingly have enough memory, the current stack is not capable of
using it efficiently.

Marek

On Wed, Nov 28, 2012 at 4:58 PM,  <j.glisse at gmail.com> wrote:
> So i spend the day looking at ttm and eviction. The first patch i sent
> earlier is i believe something that should be merged. This patch however
> is more about discussing if other people are interested in similar mecanism
> to be share among driver through ttm. I could otherwise just move its logic
> to the radeon driver.
>
> So the idea of this patch is that we don't want to constantly move object
> in and out of certain memory pool, mostly VRAM. So it adds a minimum
> residency time and no object that have been in the given pool for less
> than this residency time can be moved out. It closely solve regression
> we are having with radeon since gallium driver change and probably improve
> some other workload.
>
> Statistic i gathered on xonotic/realquake showed that we can have as much
> as 1GB in each direction (VRAM to system and system to vram) over a second.
> So we are obviously not saturating the PCIE bandwidth. Profiling shows that
> 80-90% of the cost of this eviction is in memory allocation/deallocation for
> the system memory (lot of irqlock, and mostly kernel spending time
> allocating pages thing 256 000 or more page per second to allocate/deallocate.
>
> I used this WIP patch to gather statistic and play with various combination :
> http://people.freedesktop.org/~glisse/0001-TTM-EVICT-WIP.patch
>
> Some numbers with xonotic :
> 17.369fps stock 3.7 kernel
> 27.883fps 3.7 kernel + do not preserve caching patch ~ +60%
> 49.292fps 3.7 kernel + WIP with 500ms residency for all pool and no bo wait
>           for eviction
> 49.258fps 3.7 kernel + WIP with 500ms residency for all pool and bo wait
> 48.213fps 3.7 kernel always allowing GTT placement (basicly revent the
>           gallium patch effect)
>
> Other design i am thinking of is changing the way radeon handle it's memory
> and stop trying to revalidate object to different memory pool at each cs,
> instead i think we should keep a vram lru list probably per process and move
> bo out of vram according to this lru and following some euristic. So radeon
> would only move bo into vram when there is room.
>
> Other improvement i am thinking of is to reuse GTT memory of object that are
> moved in for object that are evicted as statistic i gathered showed that it's
> often close amount that move in and out. But this would require true dma
> as it would mean scheduling in/out move on page granularity or group of
> page (write 4 pages from vram to scratch 4pages into sys, write 4 pages of
> system memory bo to vram 4 pages, write 4pages of vram to the just moved
> 4pages of system memory ...).
>
> Cheers,
> Jerome
>
> _______________________________________________
> dri-devel mailing list
> dri-devel at lists.freedesktop.org
> http://lists.freedesktop.org/mailman/listinfo/dri-devel