[PATCH 0/9] make struct drm_mm_node embeddable

Mon Nov 15 11:45:12 PST 2010

Hi Thomas,

On Mon, Nov 15, 2010 at 08:58:13AM +0100, Thomas Hellstrom wrote:
> Nice work, although I have some comments about general applicability
> that we perhaps need to think about.
> 
> 1) The space representation and space allocation algorithm is
> something that is private to the aperture management system. For a
> specialized implementation like i915 that is all fine, but Ben has
> recently abstracted that part out of the core TTM bo implementation.
> As an example, vmwgfx is now using kernel idas to manage aperture
> space, and drm_mm objects for traditional VRAM space. Hence,
> embedding drm_mm objects into ttm bos will not really be worthwile.
> At least not for aperture space management, and TTM will need to
> continue to "dance", both in the ida case and in the drm_mm case.

Yep, I've looked into this and noticed the recent addition of the ida
support. This is why I've added the "decent surgery" comment. Embedding
the drm_mm_node still looks possible, albeit perhaps not feasible (at
least I won't tackle this in the immediate future).

> For device address space, the situation is different, though, and it
> should be possible to embed the drm_mm objects, but that brings up
> the next thing:
> 
> 2) The algorithm used by drm_mm has been around for a while and has
> seen a fair amount of changes, but nobody has yet attacked the
> algorithm used to search for free space, which was just quickly put
> together as an improvement on what was the old mesa range manager.
> In moderate fragmentation situations, the performance will degrade,
> particularly with "best match" searches. In the near future we'd
> probably want to add something like a "hole rb tree" rather than a
> "hole list", and a choice of algorithm for the user. With embeddable
> objects, unless you want to waste space for unused members, you'd
> need a separate drm_mm node subclass for each algorithm, whereas if
> you don't embed, you only need to allocate what you need.

First a small rant about "best match" (to get it out of the way;-)
- "best match" is simply a misleading name: with alignment > size
  (at least on older hw) and mixes of unrestricted and range restricted
  allocations (ironlake has 2G of gtt, just 256 of it mappable), which is
  all possible with the latest experimental i915 patches, "best match" can
  do worse than the simpler approach.
- doing a full linear scan for every tiny state buffer/pixmap cache is
  slow.
At this, it serves as an excuse to not implement proper eviction support.
</rant>
[If you agree, I'll happily write the patch to rip it out. It just doesn't
bother me 'cause it's only a few lines in drm_mm.c and I can ignore the
actual users.]

Now to the more useful discussion: IMHO drm_mm.c should be an allocator
for vram/(g)tt, i.e. it needs to support:
- a mix of large/small sizes.
- fancy alignment constrains (new patches for drm/i915 are pushing things
  there).
- range-restricted allocations. I think current users only ever have one
  (start, end) set for restricted allocations, so this might actually be
  simplified.
If other users don't fit into this anymore, mea culpa, they need they're
own allocator. You've already taken this path for vmwgfx by using the ida
allocator. And if the linear scan for the gem mmap offset allocator ever
shows up in profiles, I think it's better served with a buddy-style
pot-sized, pot-aligned allocator. After all, fragmentation of virtual
address space isn't a that severe problem.

Hence I think that drivers with extremely specific needs should roll their
own allocator. So I don't think we should anticipate different allocator
algorithms. I see driver-specific stuff more in the area of clever
eviction algorithms - i915 is currently at 5 lru's for gtt mapped bos, and
we're still adding.

Of course I've spent a bunch of brain-cycles on creating a more efficient
allocator - O(n) just doesn't look that good. Now
- it should be fast in the common case
- and not degerate into O(n) for ugly corner cases.
Which leaves us for the above allocation requirements of (u64 size, u32
alignment, bool range_restricted) with two 2d-range-trees. Now factoring
in that lru-scanning is also O(n_{gtt_mapped}) gives us a data-structure
I'm not really eager to create.

Current code seems fares rather well because the hole_stack fifo is good
at avoiding the linear scan worst-case. And as soon as we start to strash
the gtt, everything is totally snowed under by clflush overhead on i915
anyway.

To make a long story short, I've opted to make the current code faster by
avoiding kmalloc and spoiling fewer cache-lines with useless data. And if
the linear scan ever shows up in profiles, we could always add some stats
to bail out early for large allocations. Or add a tree to heuristically
find a suitable hole (assuming worst-case waste due to alignment).

Thanks a lot for your input on this.

Yours, Daniel
-- 
Daniel Vetter
Mail: daniel at ffwll.ch
Mobile: +41 (0)79 365 57 48