[Mesa-dev] [PATCH v2] i965 : optimized bucket index calculation

Tue Oct 24 09:42:55 UTC 2017

> -----Original Message-----
> From: Ian Romanick [mailto:idr at freedesktop.org]
> Sent: Friday, October 20, 2017 9:51 AM
> To: Muthukumar, Aravindan <aravindan.muthukumar at intel.com>; mesa-
> dev at lists.freedesktop.org
> Cc: J Karanje, Kedar <kedar.j.karanje at intel.com>; Tamminen, Eero T
> <eero.t.tamminen at intel.com>
> Subject: Re: [Mesa-dev] [PATCH v2] i965 : optimized bucket index calculation
> 
> On 09/13/2017 11:43 PM, aravindan.muthukumar at intel.com wrote:
> > From: Aravindan Muthukumar <aravindan.muthukumar at intel.com>
> >
> > Avoiding the loop which was running with O(n) complexity.
> > Now the complexity has been reduced to O(1)
> >
> > Algorithm calculates the index using matrix method.
> > Matrix arrangement is as below:
> > Assuming PAGE_SIZE is 4096.
> >
> >          1*4096   2*4096    3*4096    4*4096
> >          5*4096   6*4096    7*4096    8*4096
> 
> I think adding one more row to this chart would make it more clear.  The two
> rows shown also follow a simpler pattern, and that made some of the
> complexity below seem confusing.
> 
>            10*4096  12*4096   14*4096   16*4096
> 
> >           ...      ...       ...       ...
> >           ...      ...       ...       ...
> >           ...      ...       ...   max_cache_size
> >
> > From this matrix its cleary seen that every row
>                        clearly
> 
> > follows the below way:
> >          ...       ...       ...        n
> >        n+(1/4)n  n+(1/2)n  n+(3/4)n    2n
> >
> > Row is calulated as log2(size/PAGE_SIZE) Column is calculated as
> > converting the difference between the elements to fit into power size
> > of two and indexing it.
> >
> > Final Index is (row*4)+(col-1)
> >
> > Tested with Intel Mesa CI.
> >
> > Improves performance of 3d Mark on Broxton.
> > Analyzed using Compare Perf Analyser:
> > Average : 201.2 +/- 65.4836 (n=20)
> 
> Is 201 the improvement or the absolute score?  Do not quote absolute scores.
> 
> > Percentage : 0.705966% +/- 0.229767% (n=20)
> 
> Eero: Can you reproduce this result on BXT or other platforms?  Just curious...
> 
> > v2: Review comments regarding cosmetics and asserts implemented
> >
> > Signed-off-by: Aravindan Muthukumar <aravindan.muthukumar at intel.com>
> > Signed-off-by: Kedar Karanje <kedar.j.karanje at intel.com>
> > Reviewed-by: Yogesh Marathe <yogesh.marathe at intel.com>
> > ---
> >  src/mesa/drivers/dri/i965/brw_bufmgr.c | 46
> > ++++++++++++++++++++++++++++------
> >  1 file changed, 39 insertions(+), 7 deletions(-)
> >
> > diff --git a/src/mesa/drivers/dri/i965/brw_bufmgr.c
> > b/src/mesa/drivers/dri/i965/brw_bufmgr.c
> > index 8017219..8013ccb 100644
> > --- a/src/mesa/drivers/dri/i965/brw_bufmgr.c
> > +++ b/src/mesa/drivers/dri/i965/brw_bufmgr.c
> > @@ -87,6 +87,8 @@
> >
> >  #define memclear(s) memset(&s, 0, sizeof(s))
> >
> > +#define PAGE_SIZE 4096
> > +
> >  #define FILE_DEBUG_FLAG DEBUG_BUFMGR
> >
> >  static inline int
> > @@ -181,19 +183,45 @@ bo_tile_pitch(struct brw_bufmgr *bufmgr, uint32_t
> pitch, uint32_t tiling)
> >     return ALIGN(pitch, tile_width);
> >  }
> >
> > +static inline int
> > +ilog2_round_up(int value)
> > +{
> > +   assert(value != 0);
> > +   return 32 - __builtin_clz(value - 1); }
> > +
> > +/*
> > + * This function finds the correct bucket fit for the input size.
> > + * The function works with O(1) complexity when the requested size
> > + * was queried instead of iterating the size through all the buckets.
> > + */
> >  static struct bo_cache_bucket *
> >  bucket_for_size(struct brw_bufmgr *bufmgr, uint64_t size)  {
> > -   int i;
> > +   int index = -1;
> 
> I don't see how execution could get to that test without index being set again,
> so it should be safe to remove the initialization.
> 

We will skip this initialization since it doesn’t impact the index result.

> > +   int row, col = 0;
> > +   int pages, pages_log2;
> 
> Move the declarations of row, col, pages, and pages_log2 into the else case,
> initialize them with the only assignment to them, and make them const.
> 
> Since all of these values, including index, get values derived from size, I believe
> the should all be unsigned.  In that case, remove the index >= 0 test below.
> 

We will move as well as initialize the above variables in the respective else part. 

> >
> > -   for (i = 0; i < bufmgr->num_buckets; i++) {
> > -      struct bo_cache_bucket *bucket = &bufmgr->cache_bucket[i];
> > -      if (bucket->size >= size) {
> > -         return bucket;
> > -      }
> > +   /* condition for size less  than 4*4096 (16KB) page size */
>          ^                       ^ Remove one space here.
>          |
>          Capitalize and add a period at the end of the sentence.
> 
> > +   if(size <= 4 * PAGE_SIZE) {
>         ^ Add a space here.
> 
> > +      index = DIV_ROUND_UP(size, PAGE_SIZE) - 1;;
>                   Remove extra trailing semicolon. ^
> 
> > +   } else {
> > +      /* Number of pages of page size */
> > +      pages = DIV_ROUND_UP(size, PAGE_SIZE);
> > +      pages_log2 = ilog2_round_up(pages) - 1;
> 
> Isn't this going to be the same result as _mesa_logbase2(pages)?
>   

_mesa_logbase2 and ilog2_round_up uses the same builtin_clz inbuilt function.

> > +
> > +      /* Finding the row and column of the matrix */
> > +      row = pages_log2 - 1;
> > +      col = DIV_ROUND_UP((pages - (1 << pages_log2)),
> > +            (1 << (pages_log2 - 2)));
>                ^            ^
>                This should  |
>                be aligned   |
>                here. -------+
> 
> Removing some of the extra parenthesis would also make it easier to read.
> 
> So... I feel like this function doesn't need to special case size < 16k like this.  I
> haven't benchmarked it, but the following is about 30 bytes shorter and avoids
> the conditional branch.  I also think it's easier to understand.
> 
>    const unsigned pages = DIV_ROUND_UP(size, PAGE_SIZE);
> 
>    /* Row  Bucket sizes    clz((x-1) | 3)   Row    Column
>     *        in pages                      stride   size
>     *   0:   1  2  3  4 -> 30 30 30 30        4       1
>     *   1:   5  6  7  8 -> 29 29 29 29        4       1
>     *   2:  10 12 14 16 -> 28 28 28 28        8       2
>     *   3:  20 24 28 32 -> 27 27 27 27       16       4
>     */
>    const unsigned row = 30 - __builtin_clz((pages - 1) | 3);
>    const unsigned row_max_pages = 4 << row;
> 
>    /* The '& ~2' is the special case for row 1.  In row 1, max pages /
>     * 2 is 2, but the previous row maximum is zero (because there is
>     * no previous row).  All row maximum sizes are power of 2, so that
>     * is the only case where that bit will be set.
>     */
>    const unsigned prev_row_max_pages = (row_max_pages / 2) & ~2;
>    const unsigned col_size = (row_max_pages - prev_row_max_pages) / 4;
> 
>    const unsigned col = DIV_ROUND_UP(pages - prev_row_max_pages,
>                                      col_size);
> 
>    /* Using the calculated row and column to index into the matrix */
>    const unsigned index = (row * 4) + (col - 1);
> 
> The second DIV_ROUND_UP annoys me.  col_size is a power of two, so we
> should be able to just shift.  My attempts to take advantage of that yielded piles
> of extra instructions and a conditional branch.  The problem was the row == 0
> case.  The best I got was:
> 
>    int col_size_log2 = row - 1;
>    col_size_log2 += (col_size_log2 < 0);
> 
>    const unsigned col = (pages - prev_row_max_pages +
>                          ((1 << col_size_log2) - 1)) >> col_size_log2;
> 
> I also found open-coding the first DIV_ROUND_UP as (size + PAGE_SIZE -
> 1) / PAGE_SIZE generated very slightly better code.  That saves about 8 bytes of
> code... no idea if it's faster or not, but it did avoid a dependency on the flags.
> 
> You should try my different versions of this routine in your benchmark.
> 
> > +
> > +      /* Using the calculated row and column to index into the matrix */
> > +      index = (row << 2) + (col - 1);
> >     }
> >
> > -   return NULL;
> > +   /* Checking the error condition */
> 

During the intial optimization we tried similar things, but however the above logic doesn’t seems to be the standard way of finding the row and column. 
 In our algorithm, we find the row and Column of the matrix using arithmetic and geometric progression. Since the initialization of cache buckets follow the progressive(arithmetic and geometric) way,
we used the same way. 

In any logic, for finding the index for row==0, we needed to do some extra operations.

> Assuming the index >= 0 check is removed, per my suggestion above, I don't feel
> like this is a very useful comment.  If index is >= num_buckets, then the size is
> too large?  The cache bucket hasn't been created yet?
> 
> > +   return (index >= 0 && index < bufmgr->num_buckets) ?
> > +          (&bufmgr->cache_bucket[index]) : NULL;
>              ^                            ^
>              Remove these extraneous parenthesis.
> 
> >  }

Earlier, if the input size is greater or an invalid size, the original  code was returning NULL, the same behavior is maintained whenever the input size or calculated index is invalid.

> >
> >  int
> > @@ -1239,6 +1267,10 @@ add_bucket(struct brw_bufmgr *bufmgr, int size)
> >     list_inithead(&bufmgr->cache_bucket[i].head);
> >     bufmgr->cache_bucket[i].size = size;
> >     bufmgr->num_buckets++;
> > +
> > +   assert(bucket_for_size(bufmgr, size) == &bufmgr->cache_bucket[i]);
> > +   assert(bucket_for_size(bufmgr, size - 2048) == &bufmgr->cache_bucket[i]);
> > +   assert(bucket_for_size(bufmgr, size + 1) !=
> > + &bufmgr->cache_bucket[i]);
> >  }
> >
> >  static void
> >