[Mesa-dev] [PATCH v3 07/18] anv/allocator: Add a BO cache

Mon Apr 3 22:45:49 UTC 2017

On Mon 03 Apr 2017, Jason Ekstrand wrote:
> On Mon, Apr 3, 2017 at 12:31 PM, Chad Versace <chadversary at chromium.org>
> wrote:
> 
> > On Fri 31 Mar 2017, Chad Versace wrote:
> > > On Wed 15 Mar 2017, Jason Ekstrand wrote:
> > > > This cache allows us to easily ensure that we have a unique anv_bo for
> > > > each gem handle.  We'll need this in order to support multiple-import
> > of
> > > > memory objects and semaphores.
> > > >
> > > > v2 (Jason Ekstrand):
> > > >  - Reject BO imports if the size doesn't match the prime fd size as
> > > >    reported by lseek().
> > > >
> > > > v3 (Jason Ekstrand):
> > > >  - Fix reference counting around cache_release (Chris Willson)
> > > >  - Move the mutex_unlock() later in cache_release
> > > > ---
> > > >  src/intel/vulkan/anv_allocator.c | 261 ++++++++++++++++++++++++++++++
> > +++++++++
> > > >  src/intel/vulkan/anv_private.h   |  26 ++++
> > > >  2 files changed, 287 insertions(+)
> > >
> > >
> > > > +static uint32_t
> > > > +hash_uint32_t(const void *key)
> > > > +{
> > > > +   return (uint32_t)(uintptr_t)key;
> > > > +}
> > >
> > > This hash function does not appear hashy.
> > >
> > > If I correctly understand the details of Mesa's struct hash_table,
> > > choosing the identify function for the hash function causes unwanted
> > > clustering when inserting consecutive gem handles.  Since the kernel does
> > > allocate gem handles consecutively, the problem is real.
> > >
> > > For proof, consider the following:
> > >
> > >    - Suppose a long-running process (maybe the compositor) has thrashed
> > on the
> > >      hash table long enough that its bucket count
> > >      is ht->size = hash_sizes[7].size = 283. Suppose a spike of
> > >      compositor activity raises the hash table's density to about 0.5.
> > >      And suppose the hash table buckets are filled with the consecutive
> > gem
> > >      handles
> > >
> > >      {0, 0, 0, 0, 4, 5, 6, 7, 8, 9, ..., 127, 128, 0, 0, 0, ..., 0 }
> > >
> > >      The exact density is (128 - 4 + 1) / 283 = 0.4417.
> > >
> > >    - Next, some other in-process activity (maybe OpenGL) generated
> > >      a lot of gem handles after Vulkan's most recently imported
> > >      gem handle, 128.
> >
> > This point in the example---the reason why the gem handles in the
> > anv_bo_cache skip from 128 to 287---is bogus in Vulkan. The problem *is*
> > real for multiple in-process OpenGL contexts derived from the same
> > EGLDisplay, using EGL_EXT_image_dma_buf_import, because each context
> > shares the same intel_screen, and therefore the same drm device fd. But
> > in Vulkan, each VkDevice opens its own drm device id. So, bogus example.
> >
> > BUT, that leads to a new question...
> >
> > Since each VkDevice has a unique drm device fd, and since the kernel
> > allocates gem handles consecutively on the fd, and since struct
> > hash_table only grows and never shrinks, and since patch 8/18 inserts
> > every VkDeviceMemory into the cache... I believe no collisions are
> > possible in anv_bo_cache.
> >
> 
> Does this fall under the category of unbreakable kernel ABI or is it just a
> side-effect of the implementation?  If not, then I'm reluctant to trust it.

I'm not certain. But krh, sitting beside me, says "It's ABI at this point.
The kernel uses a 'idr' structure which guarantees that behavior".

> > If there are no collisions, then the hash table is only adding overhead,
> 
> Sure, but a no-collision hash table is pretty cheap...
> 
> > and we should use a direct-addressing lookup table. The bo cache should
> > look like this:
> >
> >     struct anv_bo_cache {
> >        /* The array indices are gem handles. Null entries are legal. */
> >        struct anv_bo **bos;
> >
> >        /* Length of the array. Because the array can have holes, this
> >         * is *not* the number of gem handles in the array.
> >         */
> >        size_t len;
> >
> >        pthread_mutex_t mutex;
> >     };
> >
> >     struct anv_bo *
> >     anv_bo_cache_lookup(struct anv_bo_cache *cache, uint32_t gem_handle)
> >     {
> >        struct anv_bo *bo = NULL;
> >
> >        pthread_mutex_lock(&cache->mutex);
> >
> >        if (gem_handle < cache->len)
> >           bo = cache->entries[gem_handle];
> >
> >        pthread_mutex_unlock(&cache->mutex);
> >
> >        return bo;
> >
> >     }
> >
> > BUT, that leads to yet another question...
> >
> > Why is patch 8/18 inserting every VkDeviceMemory into the cache? If
> > I understand things correctly, we only *need* to insert a VkDeviceMemory
> > into the cache if, in vkAllocateMemory, either (1)
> > VkExportMemoryAllocateInfoKHX::handleTypes != 0 or (2)
> > VkMemoryAllocateInfo's pNext chain contains an import structure.
> >
> 
> Because I'm lazy.  In order to start using the bo cache,
> anv_device_memory::bo needs to be a pointer (well, it's makes the BO cache
> API simpler and more efficient if it's a pointer).  This would mean that we
> would have to allocate an additional chunk of memory or go through some
> other hoops in order to make it work.  At the end of the day, just stuffing
> everything in the cache was simpler and kept us to a single path.

Huh? I don't understand how pointers-or-not-pointers affect the decision
to use an array or a hash table.  In the lookup function I wrote above,
the anv_bo_cache holds pointers to anv_bo's, just like the anv_bo_cache
in the actual patch. (I kept it that way so the API would be the same as
in your patch series).

> > If we insert into the cache only those VkDeviceMemories that are
> > imported or that will be exported, then the bo cache remains small, and
> > we *should* use a hash table.
> >
> 
> Maybe.  But the client isn't supposed to be allocating hundreds of
> VkDeviceMemory objects.  It's supposed to allocate a few and then
> suballocate from those.  If the client allocates so many memory objects
> that they start hitting hash table performance issues, that's their own
> fault.
> 
> Also, please remember that vkAllocateMemory is considered to be a *very*
> heavy-weight function in Vulkan.  Compared to an ioctl which allocates
> memory, a hash table insert is trivial.  I'm reasonably happy to make a few
> changes here or there to make it more efficient if any of this proves to be
> a problem.  However, I think we're working way too hard to micro-optimize
> something that takes < 0.001% of runtime.

True. The difference between time needed for an array lookup and a hash
table lookup is tiny compared to the ioctl. I wasn't suggesting to use
an array for that 0.001% though. I suggested an array because struct
hash_table, as used in this patch, is just a resizable array with an
unneeded vtable. It just felt weird.

We've discussed all the things :) I'll drop the complaint about
hash-vs-array.