[Mesa-dev] [PATCH 06/10] i965: Add virtual memory allocator infrastructure to brw_bufmgr.

Mon May 7 22:21:54 UTC 2018

On Saturday, May 5, 2018 12:25:20 AM PDT Chris Wilson wrote:
> Quoting Kenneth Graunke (2018-05-04 02:12:36)
[snip]
> > +static uint64_t
> > +bucket_vma_alloc(struct brw_bufmgr *bufmgr,
> > +                 struct bo_cache_bucket *bucket,
> > +                 enum brw_memory_zone memzone)
> > +{
> > +   struct util_dynarray *vma_list = &bucket->vma_list[memzone];
> > +   struct vma_bucket_node *node;
> > +
> > +   if (vma_list->size == 0) {
> > +      /* This bucket allocator is out of space - allocate a new block of
> > +       * memory for 64 blocks from a larger allocator (either a larger
> > +       * bucket or util_vma).
> > +       *
> > +       * We align the address to the node size (64 blocks) so that
> > +       * bucket_vma_free can easily compute the starting address of this
> > +       * block by rounding any address we return down to the node size.
> > +       *
> > +       * Set the first bit used, and return the start address.
> > +       */
> > +      uint64_t node_size = 64ull * bucket->size;
> > +      node = util_dynarray_grow(vma_list, sizeof(struct vma_bucket_node));
> 
> if (!node)
> 	return 0;
> 

Fixed, thanks.

[snip]
> > +static void
> > +bucket_vma_free(struct bo_cache_bucket *bucket, uint64_t address)
[snip]
> > +   if (!node) {
> > +      /* No node - the whole group of 64 blocks must have been in-use. */
> > +      node = util_dynarray_grow(vma_list, sizeof(struct vma_bucket_node));
> 
> if (!node)
> 	vma_free(start, 64 * bucket->size);

That doesn't work.  63 out of 64 of our suballocations from this group
are still in use.  This would free the entire block of 64 addresses.

We're trying to record the single suballocation that's now free.  If we
can't malloc any space, we can't record it here, and we can't return a
small piece of the memory to the larger allocator (which might be
another bucket).

> > +      node->start_address = start;
> > +      node->bitmap = 0ull;
> > +   }
> > +
> > +   /* Set the bit to return the memory. */
> > +   assert((node->bitmap & (1ull << bit)) == 0ull);
> > +   node->bitmap |= 1ull << bit;
> > +
> > +   /* The block might be entirely free now, and if so, we could return it
> > +    * to the larger allocator.  But we may as well hang on to it, in case
> > +    * we get more allocations at this block size.
> > +    */
> 
> Return it! Fear the fragmentation, it'll creep up overtime. And might be
> a pressing issue for the partitions.

I don't think so.  Ultimately, my plan is to lay out the virtual address
space as follows:

   - [0,   4K): Nothing  (empty page for null address)
   - [4K,  4G): Shaders  (Instruction Base Address)
   - [4G,  8G): Surfaces (Surface State Base Address, Bindless ...)
   - [8G, 12G): Dynamic  (Dynamic State Base Address)
   - [12G, *):  Other    (everything else in the full 48-bit VMA)

I plan to have a brw_uploader for each zone - one for shader kernels,
one for dynamic state, one for surface states.  brw_uploader uses fixed
size BOs, and makes more BOs at that fixed size if it runs out of space.

That means that we'll be continually cycling through buffers of the same
size...which makes this policy of "there will be more BOs of the same
size soon" pretty reasonable.  It's the only size, really.

The large zone is measured in TB, so I'm not worried about it.

> > +static struct bo_cache_bucket *
> > +get_bucket_allocator(struct brw_bufmgr *bufmgr, uint64_t size)
> 
> get_vma_bucket()
> 
> I think it's useful to keep the distinction between allocation of
> physical pages and allocation of GTT offset.

I've been wondering about that, too.  We'd have to track a separate
bo->size (size of underlying GEM object) and bo->vma_size (size of
VMA range allocated to it).  It also complicates the system quite a
bit, as we might have one bucket for BO size, and a different bucket
for VMA size.  Gets messy.  Might want to entirely separate them.

On future hardware, we're going to need to place certain buffers at
an aligned memory address...which doesn't fit very well with the
bucket allocator system.  Since it fundamentally allocates 64N objects
at a time, and returns 0N, 1N, 2N, ...the size of N has to be such that
0/1/2/N are all properly aligned.  So, you have to round up the size.

That's kind of unfortunate and wasteful...we don't need to actually
allocate that memory in the underlying GEM object.  It could bring us
closer to the 4GB pin limit.  But then I thought about our current
bucketing, which rounds up the requested size to the bucket, and has
the exact same problem.  I guess James Xiong is trying to fix that...

> > +{
> > +   /* Skip using the bucket allocator for very large sizes, as it allocates
> > +    * 64 of them and this can balloon rather quickly.
> > +    */
> > +   if (size > 1024 * PAGE_SIZE)
> > +      return NULL;
> 
> Shouldn't this be taking memzone into account? Both for its limit, and
> for the freelist.

I mean, maybe?  Given that I plan to use brw_uploader for the small
zones, there will only be a single bucket size in use anyway, so...it
really only matters for the large zone...

> > +   struct bo_cache_bucket *bucket = bucket_for_size(bufmgr, size);
> > +
> > +   if (bucket && bucket->size == size)
> 
> It can return the wrong bucket?

No.  The bucket allocators can't handle arbitrary sizes, they always
allocate exactly bucket->size.  So, that better be the size you're
asking for.  If not, bucketing won't work, and we return NULL so that
we fall back to util_vma, which can handle anything.

Normally, bo_alloc_internal rounds up the size to the bucket size
anyway, so we'll find a bucket, and the size will match.  However, the
other vma_alloc callers don't do that.  For example, if you import a
PRIME buffer that's small enough to hit a BO cache bucket, but not
exactly the size...then we just bail and use util_vma.

I figure that case is uncommon enough that it doesn't matter.

> > +      return bucket;
> > +
> > +   return NULL;
> > +}
> > +
> > +/** Like vma_alloc, but returns a non-canonicalized address. */
> > +static uint64_t
> > +__vma_alloc(struct brw_bufmgr *bufmgr,
> > +            enum brw_memory_zone memzone,
> > +            uint64_t size,
> > +            uint64_t alignment)
> > +{
> > +   /* Without softpin support, we let the kernel assign addresses. */
> > +   assert(brw_using_softpin(bufmgr));
> > +
> > +   struct bo_cache_bucket *bucket = get_bucket_allocator(bufmgr, size);
> > +   uint64_t addr;
> > +
> > +   if (bucket) {
> > +      addr = bucket_vma_alloc(bufmgr, bucket, memzone);
> > +   } else {
> > +      addr = util_vma_heap_alloc(&bufmgr->vma_allocator[memzone], size,
> > +                                 alignment);
> 
> I still think you want a single vma allocator that supports up/down,
> <4GB starts at the bottom, any starts at the top. You can still reserve
> memzones within that and use a second allocator, or freelists, but you
> get the flexibility of having all memory tracked in the top level
> manager, avoiding the worst of the static preassignment of blocks.

I don't see the advantage.  Since I hope to have 4 memory zones, a
top/bottom allocator isn't sufficient - it only gives me two.  Using
separate 1-dimensional util_vma allocators per zone seems to work fine.

The up/down things-growing-towards-each-other model allows more
flexibility on the meeting point, but that's not useful here.  The
state base addresses impose a hard 4GB limit.  You can't exceed it.
It would allow the large zone to bleed into the smaller ones, but it's
already so massive by comparison that it's unlikely to ever be useful.

> But 48b, so who cares? I'd like to see this work for as little as 2GiB
> GTT (Haswell).

So for Haswell, we'd need real PPGTT.  Presumably you're thinking you
can make that happen, which would be awesome.  At that point, though,
the limit is pinning > 2GB of memory per batch, right?  We still have
a 4GB VMA, and could place them anywhere in there.  We'd just need to
be careful about wasting space (so rounding up sizes is bad).

My thinking for 32-bit address spaces was to eschew the memory zones,
and just have one 4GB zone for everything.  Point all state base
addresses at 0.  Until recently, 4GB is all we had on Gen8+ anyway...

> > @@ -360,7 +591,16 @@ retry:
> >        }
> >     }
> >  
> > -   if (!alloc_from_cache) {
> > +   if (alloc_from_cache) {
> > +      /* If the cache BO isn't in the right memory zone, free the old
> > +       * memory and assign it a new address.
> > +       */
> > +      if (brw_using_softpin(bufmgr) &&
> > +          memzone != memzone_for_address(bo->gtt_offset)) {
> > +         vma_free(bufmgr, bo->gtt_offset, bo->size);
> > +         bo->gtt_offset = 0ull;
> 
> Can we use bo->gtt_offset == -1ull for invalid? And assert later on it's
> never added to the validation objects. But maybe you feel strongly about
> reserving offset 0 as a NULL identifier (just NULL is a valid address as
> well :)

I do feel strongly about 0ull being a null address.  Lots of places in
the driver program 0 for "nothing relevant here".  For example, when
hull shaders are disabled, we just set 3DSTATE_CONSTANT_HS to 0's.

I figure burning a single page isn't the worst thing in the world,
and it saves us from having to think about that all over the driver.

> > +      }
> > +   } else {
> >        bo = calloc(1, sizeof(*bo));
> >        if (!bo)
> >           goto err;
> > @@ -404,6 +644,13 @@ retry:
> >           goto err_free;
> >     }
> >  
> > +   if (brw_using_softpin(bufmgr) && bo->gtt_offset == 0ull) {
> > +      bo->gtt_offset = vma_alloc(bufmgr, memzone, bo->size, 1);
> > +
> > +      if (bo->gtt_offset == 0ull)
> > +         goto err_free;
> 
> You could just disable softpin for the bo; the kernel will find room.
> But if you plan to have special memzones, then yes I can understand
> outright rejection if one memzone is full.
> -Chris

Definitely not interested in that.  One of the main points of this work
is to move towards a world where relocations don't even exist.  While we
can't get rid of them outright, we can certainly not bother with them
for any new features on Gen10+.  For example, for indirect clear colors,
we could just write bo->gtt_offset and call it a day, rather than going
through the awkward dance of packing the address with other fields that
exist in SURFACE_STATE, via either WC readbacks or misery to avoid them.
This should also make it possible to pre-bake some packets, instead of
having to re-emit or patch them because the addresses may have changed.

A hybrid approach means we have all of the misery of relocating things,
for a scenario that should hopefully never occur.

--Ken
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 833 bytes
Desc: This is a digitally signed message part.
URL: <https://lists.freedesktop.org/archives/mesa-dev/attachments/20180507/dabe37b7/attachment.sig>