[Mesa-dev] [PATCH v2] i965 : optimized bucket index calculation

Fri Oct 20 04:20:44 UTC 2017

On 09/13/2017 11:43 PM, aravindan.muthukumar at intel.com wrote:
> From: Aravindan Muthukumar <aravindan.muthukumar at intel.com>
> 
> Avoiding the loop which was running with O(n) complexity.
> Now the complexity has been reduced to O(1)
> 
> Algorithm calculates the index using matrix method.
> Matrix arrangement is as below:
> Assuming PAGE_SIZE is 4096.
> 
>          1*4096   2*4096    3*4096    4*4096
>          5*4096   6*4096    7*4096    8*4096

I think adding one more row to this chart would make it more clear.  The
two rows shown also follow a simpler pattern, and that made some of the
complexity below seem confusing.

           10*4096  12*4096   14*4096   16*4096

>           ...      ...       ...       ...
>           ...      ...       ...       ...
>           ...      ...       ...   max_cache_size
> 
> From this matrix its cleary seen that every row
                       clearly

> follows the below way:
>          ...       ...       ...        n
>        n+(1/4)n  n+(1/2)n  n+(3/4)n    2n
> 
> Row is calulated as log2(size/PAGE_SIZE)
> Column is calculated as converting the difference
> between the elements to fit into power size of two
> and indexing it.
> 
> Final Index is (row*4)+(col-1)
> 
> Tested with Intel Mesa CI.
> 
> Improves performance of 3d Mark on Broxton.
> Analyzed using Compare Perf Analyser:
> Average : 201.2 +/- 65.4836 (n=20)

Is 201 the improvement or the absolute score?  Do not quote absolute scores.

> Percentage : 0.705966% +/- 0.229767% (n=20)

Eero: Can you reproduce this result on BXT or other platforms?  Just
curious...

> v2: Review comments regarding cosmetics and asserts implemented
> 
> Signed-off-by: Aravindan Muthukumar <aravindan.muthukumar at intel.com>
> Signed-off-by: Kedar Karanje <kedar.j.karanje at intel.com>
> Reviewed-by: Yogesh Marathe <yogesh.marathe at intel.com>
> ---
>  src/mesa/drivers/dri/i965/brw_bufmgr.c | 46 ++++++++++++++++++++++++++++------
>  1 file changed, 39 insertions(+), 7 deletions(-)
> 
> diff --git a/src/mesa/drivers/dri/i965/brw_bufmgr.c b/src/mesa/drivers/dri/i965/brw_bufmgr.c
> index 8017219..8013ccb 100644
> --- a/src/mesa/drivers/dri/i965/brw_bufmgr.c
> +++ b/src/mesa/drivers/dri/i965/brw_bufmgr.c
> @@ -87,6 +87,8 @@
>  
>  #define memclear(s) memset(&s, 0, sizeof(s))
>  
> +#define PAGE_SIZE 4096
> +
>  #define FILE_DEBUG_FLAG DEBUG_BUFMGR
>  
>  static inline int
> @@ -181,19 +183,45 @@ bo_tile_pitch(struct brw_bufmgr *bufmgr, uint32_t pitch, uint32_t tiling)
>     return ALIGN(pitch, tile_width);
>  }
>  
> +static inline int
> +ilog2_round_up(int value)
> +{
> +   assert(value != 0);
> +   return 32 - __builtin_clz(value - 1);
> +}
> +
> +/*
> + * This function finds the correct bucket fit for the input size.
> + * The function works with O(1) complexity when the requested size
> + * was queried instead of iterating the size through all the buckets.
> + */
>  static struct bo_cache_bucket *
>  bucket_for_size(struct brw_bufmgr *bufmgr, uint64_t size)
>  {
> -   int i;
> +   int index = -1;

I don't see how execution could get to that test without index being set
again, so it should be safe to remove the initialization.

> +   int row, col = 0;
> +   int pages, pages_log2;

Move the declarations of row, col, pages, and pages_log2 into the else
case, initialize them with the only assignment to them, and make them const.

Since all of these values, including index, get values derived from
size, I believe the should all be unsigned.  In that case, remove the
index >= 0 test below.

>  
> -   for (i = 0; i < bufmgr->num_buckets; i++) {
> -      struct bo_cache_bucket *bucket = &bufmgr->cache_bucket[i];
> -      if (bucket->size >= size) {
> -         return bucket;
> -      }
> +   /* condition for size less  than 4*4096 (16KB) page size */
         ^                       ^ Remove one space here.
         |
         Capitalize and add a period at the end of the sentence.

> +   if(size <= 4 * PAGE_SIZE) {
        ^ Add a space here.

> +      index = DIV_ROUND_UP(size, PAGE_SIZE) - 1;;
                  Remove extra trailing semicolon. ^

> +   } else {
> +      /* Number of pages of page size */
> +      pages = DIV_ROUND_UP(size, PAGE_SIZE);
> +      pages_log2 = ilog2_round_up(pages) - 1;

Isn't this going to be the same result as _mesa_logbase2(pages)?

> +
> +      /* Finding the row and column of the matrix */
> +      row = pages_log2 - 1;
> +      col = DIV_ROUND_UP((pages - (1 << pages_log2)),
> +            (1 << (pages_log2 - 2)));
               ^            ^
               This should  |
               be aligned   |
               here. -------+

Removing some of the extra parenthesis would also make it easier to read.

So... I feel like this function doesn't need to special case size < 16k
like this.  I haven't benchmarked it, but the following is about 30
bytes shorter and avoids the conditional branch.  I also think it's
easier to understand.

   const unsigned pages = DIV_ROUND_UP(size, PAGE_SIZE);

   /* Row  Bucket sizes    clz((x-1) | 3)   Row    Column
    *        in pages                      stride   size
    *   0:   1  2  3  4 -> 30 30 30 30        4       1
    *   1:   5  6  7  8 -> 29 29 29 29        4       1
    *   2:  10 12 14 16 -> 28 28 28 28        8       2
    *   3:  20 24 28 32 -> 27 27 27 27       16       4
    */
   const unsigned row = 30 - __builtin_clz((pages - 1) | 3);
   const unsigned row_max_pages = 4 << row;

   /* The '& ~2' is the special case for row 1.  In row 1, max pages /
    * 2 is 2, but the previous row maximum is zero (because there is
    * no previous row).  All row maximum sizes are power of 2, so that
    * is the only case where that bit will be set.
    */
   const unsigned prev_row_max_pages = (row_max_pages / 2) & ~2;
   const unsigned col_size = (row_max_pages - prev_row_max_pages) / 4;

   const unsigned col = DIV_ROUND_UP(pages - prev_row_max_pages,
                                     col_size);

   /* Using the calculated row and column to index into the matrix */
   const unsigned index = (row * 4) + (col - 1);

The second DIV_ROUND_UP annoys me.  col_size is a power of two, so we
should be able to just shift.  My attempts to take advantage of that
yielded piles of extra instructions and a conditional branch.  The
problem was the row == 0 case.  The best I got was:

   int col_size_log2 = row - 1;
   col_size_log2 += (col_size_log2 < 0);

   const unsigned col = (pages - prev_row_max_pages +
                         ((1 << col_size_log2) - 1)) >> col_size_log2;

I also found open-coding the first DIV_ROUND_UP as (size + PAGE_SIZE -
1) / PAGE_SIZE generated very slightly better code.  That saves about 8
bytes of code... no idea if it's faster or not, but it did avoid a
dependency on the flags.

You should try my different versions of this routine in your benchmark.

> +
> +      /* Using the calculated row and column to index into the matrix */
> +      index = (row << 2) + (col - 1);
>     }
>  
> -   return NULL;
> +   /* Checking the error condition */

Assuming the index >= 0 check is removed, per my suggestion above, I
don't feel like this is a very useful comment.  If index is >=
num_buckets, then the size is too large?  The cache bucket hasn't been
created yet?

> +   return (index >= 0 && index < bufmgr->num_buckets) ?
> +          (&bufmgr->cache_bucket[index]) : NULL;
             ^                            ^
             Remove these extraneous parenthesis.

>  }
>  
>  int
> @@ -1239,6 +1267,10 @@ add_bucket(struct brw_bufmgr *bufmgr, int size)
>     list_inithead(&bufmgr->cache_bucket[i].head);
>     bufmgr->cache_bucket[i].size = size;
>     bufmgr->num_buckets++;
> +
> +   assert(bucket_for_size(bufmgr, size) == &bufmgr->cache_bucket[i]);
> +   assert(bucket_for_size(bufmgr, size - 2048) == &bufmgr->cache_bucket[i]);
> +   assert(bucket_for_size(bufmgr, size + 1) != &bufmgr->cache_bucket[i]);
>  }
>  
>  static void
>