[Mesa-dev] [PATCH v2] i965 : optimized bucket index calculation

Thu Nov 2 07:55:26 UTC 2017

> > >On 10/24/2017 02:42 AM, Muthukumar, Aravindan wrote:
> > >>> -----Original Message-----
> > >>> From: Ian Romanick [mailto:idr at freedesktop.org]
> > >>> Sent: Friday, October 20, 2017 9:51 AM
> > >>> To: Muthukumar, Aravindan <aravindan.muthukumar at intel.com>; mesa-
> > >>> dev at lists.freedesktop.org
> > >>> Cc: J Karanje, Kedar <kedar.j.karanje at intel.com>; Tamminen, Eero T
> > >>> <eero.t.tamminen at intel.com>
> > >>> Subject: Re: [Mesa-dev] [PATCH v2] i965 : optimized bucket index
> > >>> calculation
> > >>>
> > >>> On 09/13/2017 11:43 PM, aravindan.muthukumar at intel.com wrote:
> > >>>> From: Aravindan Muthukumar <aravindan.muthukumar at intel.com>
> > >>>>
> > >>>> Avoiding the loop which was running with O(n) complexity.
> > >>>> Now the complexity has been reduced to O(1)
> > >>>>
> > >>>> Algorithm calculates the index using matrix method.
> > >>>> Matrix arrangement is as below:
> > >>>> Assuming PAGE_SIZE is 4096.
> > >>>>
> > >>>>          1*4096   2*4096    3*4096    4*4096
> > >>>>          5*4096   6*4096    7*4096    8*4096
> > >>>
> > >>> I think adding one more row to this chart would make it more clear.
> > >>> The two rows shown also follow a simpler pattern, and that made
> > >>> some of the complexity below seem confusing.
> > >>>
> > >>>            10*4096  12*4096   14*4096   16*4096
> > >>>
> > >>>>           ...      ...       ...       ...
> > >>>>           ...      ...       ...       ...
> > >>>>           ...      ...       ...   max_cache_size
> > >>>>
> > >>>> From this matrix its cleary seen that every row
> > >>>                        clearly
> > >>>
> > >>>> follows the below way:
> > >>>>          ...       ...       ...        n
> > >>>>        n+(1/4)n  n+(1/2)n  n+(3/4)n    2n
> > >>>>
> > >>>> Row is calulated as log2(size/PAGE_SIZE) Column is calculated as
> > >>>> converting the difference between the elements to fit into power
> > >>>> size of two and indexing it.
> > >>>>
> > >>>> Final Index is (row*4)+(col-1)
> > >>>>
> > >>>> Tested with Intel Mesa CI.
> > >>>>
> > >>>> Improves performance of 3d Mark on Broxton.
> > >>>> Analyzed using Compare Perf Analyser:
> > >>>> Average : 201.2 +/- 65.4836 (n=20)
> > >>>
> > >>> Is 201 the improvement or the absolute score?  Do not quote
> > >>> absolute
> > scores.
> > >>>
> > >>>> Percentage : 0.705966% +/- 0.229767% (n=20)
> > >>>
> > >>> Eero: Can you reproduce this result on BXT or other platforms?
> > >>> Just
> > curious...
> > >>>
> > >>>> v2: Review comments regarding cosmetics and asserts implemented
> > >>>>
> > >>>> Signed-off-by: Aravindan Muthukumar
> > >>>> <aravindan.muthukumar at intel.com>
> > >>>> Signed-off-by: Kedar Karanje <kedar.j.karanje at intel.com>
> > >>>> Reviewed-by: Yogesh Marathe <yogesh.marathe at intel.com>
> > >>>> ---
> > >>>>  src/mesa/drivers/dri/i965/brw_bufmgr.c | 46
> > >>>> ++++++++++++++++++++++++++++------
> > >>>>  1 file changed, 39 insertions(+), 7 deletions(-)
> > >>>>
> > >>>> diff --git a/src/mesa/drivers/dri/i965/brw_bufmgr.c
> > >>>> b/src/mesa/drivers/dri/i965/brw_bufmgr.c
> > >>>> index 8017219..8013ccb 100644
> > >>>> --- a/src/mesa/drivers/dri/i965/brw_bufmgr.c
> > >>>> +++ b/src/mesa/drivers/dri/i965/brw_bufmgr.c
> > >>>> @@ -87,6 +87,8 @@
> > >>>>
> > >>>>  #define memclear(s) memset(&s, 0, sizeof(s))
> > >>>>
> > >>>> +#define PAGE_SIZE 4096
> > >>>> +
> > >>>>  #define FILE_DEBUG_FLAG DEBUG_BUFMGR
> > >>>>
> > >>>>  static inline int
> > >>>> @@ -181,19 +183,45 @@ bo_tile_pitch(struct brw_bufmgr *bufmgr,
> > >>>> uint32_t
> > >>> pitch, uint32_t tiling)
> > >>>>     return ALIGN(pitch, tile_width);  }
> > >>>>
> > >>>> +static inline int
> > >>>> +ilog2_round_up(int value)
> > >>>> +{
> > >>>> +   assert(value != 0);
> > >>>> +   return 32 - __builtin_clz(value - 1); }
> > >>>> +
> > >>>> +/*
> > >>>> + * This function finds the correct bucket fit for the input size.
> > >>>> + * The function works with O(1) complexity when the requested
> > >>>> +size
> > >>>> + * was queried instead of iterating the size through all the buckets.
> > >>>> + */
> > >>>>  static struct bo_cache_bucket *
> > >>>>  bucket_for_size(struct brw_bufmgr *bufmgr, uint64_t size)  {
> > >>>> -   int i;
> > >>>> +   int index = -1;
> > >>>
> > >>> I don't see how execution could get to that test without index
> > >>> being set again, so it should be safe to remove the initialization.
> > >>>
> > >>
> > >> We will skip this initialization since it doesn’t impact the index result.
> > >>
> > >>
> > >>>> +   int row, col = 0;
> > >>>> +   int pages, pages_log2;
> > >>>
> > >>> Move the declarations of row, col, pages, and pages_log2 into the
> > >>> else case, initialize them with the only assignment to them, and
> > >>> make them
> > >const.
> > >>>
> > >>> Since all of these values, including index, get values derived
> > >>> from size, I believe the should all be unsigned.  In that case,
> > >>> remove the index >= 0
> > >test below.
> > >>>
> > >>
> > >> We will move as well as initialize the above variables in the
> > >> respective else
> > part.
> > >>
> > >>>>
> > >>>> -   for (i = 0; i < bufmgr->num_buckets; i++) {
> > >>>> -      struct bo_cache_bucket *bucket = &bufmgr->cache_bucket[i];
> > >>>> -      if (bucket->size >= size) {
> > >>>> -         return bucket;
> > >>>> -      }
> > >>>> +   /* condition for size less  than 4*4096 (16KB) page size */
> > >>>          ^                       ^ Remove one space here.
> > >>>          |
> > >>>          Capitalize and add a period at the end of the sentence.
> > >>>
> > >>>> +   if(size <= 4 * PAGE_SIZE) {
> > >>>         ^ Add a space here.
> > >>>
> > >>>> +      index = DIV_ROUND_UP(size, PAGE_SIZE) - 1;;
> > >>>                   Remove extra trailing semicolon. ^
> > >>>
> > >>>> +   } else {
> > >>>> +      /* Number of pages of page size */
> > >>>> +      pages = DIV_ROUND_UP(size, PAGE_SIZE);
> > >>>> +      pages_log2 = ilog2_round_up(pages) - 1;
> > >>>
> > >>> Isn't this going to be the same result as _mesa_logbase2(pages)?
> > >>>
> > >>
> > >> _mesa_logbase2 and ilog2_round_up uses the same builtin_clz inbuilt
> > function.
> > >
> > >What I think I hear you saying is, "There's no reason to add a new
> > >function that does the same thing an existing function does."  I can
> > >agree with
> > that.
> > >
> > >>>> +
> > >>>> +      /* Finding the row and column of the matrix */
> > >>>> +      row = pages_log2 - 1;
> > >>>> +      col = DIV_ROUND_UP((pages - (1 << pages_log2)),
> > >>>> +            (1 << (pages_log2 - 2)));
> > >>>                ^            ^
> > >>>                This should  |
> > >>>                be aligned   |
> > >>>                here. -------+
> > >>>
> > >>> Removing some of the extra parenthesis would also make it easier to
> read.
> > >>>
> > >>> So... I feel like this function doesn't need to special case size
> > >>> < 16k like this.  I haven't benchmarked it, but the following is
> > >>> about
> > >>> 30 bytes shorter and avoids the conditional branch.  I also think
> > >>> it's easier to
> > >understand.
> > >>>
> > >>>    const unsigned pages = DIV_ROUND_UP(size, PAGE_SIZE);
> > >>>
> > >>>    /* Row  Bucket sizes    clz((x-1) | 3)   Row    Column
> > >>>     *        in pages                      stride   size
> > >>>     *   0:   1  2  3  4 -> 30 30 30 30        4       1
> > >>>     *   1:   5  6  7  8 -> 29 29 29 29        4       1
> > >>>     *   2:  10 12 14 16 -> 28 28 28 28        8       2
> > >>>     *   3:  20 24 28 32 -> 27 27 27 27       16       4
> > >>>     */
> > >>>    const unsigned row = 30 - __builtin_clz((pages - 1) | 3);
> > >>>    const unsigned row_max_pages = 4 << row;
> > >>>
> > >>>    /* The '& ~2' is the special case for row 1.  In row 1, max pages /
> > >>>     * 2 is 2, but the previous row maximum is zero (because there is
> > >>>     * no previous row).  All row maximum sizes are power of 2, so that
> > >>>     * is the only case where that bit will be set.
> > >>>     */
> > >>>    const unsigned prev_row_max_pages = (row_max_pages / 2) & ~2;
> > >>>    const unsigned col_size = (row_max_pages - prev_row_max_pages)
> > >>> / 4;
> > >>>
> > >>>    const unsigned col = DIV_ROUND_UP(pages - prev_row_max_pages,
> > >>>                                      col_size);
> > >>>
> > >>>    /* Using the calculated row and column to index into the matrix */
> > >>>    const unsigned index = (row * 4) + (col - 1);
> > >>>
> > >>> The second DIV_ROUND_UP annoys me.  col_size is a power of two, so
> > >>> we should be able to just shift.  My attempts to take advantage of
> > >>> that yielded piles of extra instructions and a conditional branch.
> > >>> The problem was the row == 0 case.  The best I got was:
> > >>>
> > >>>    int col_size_log2 = row - 1;
> > >>>    col_size_log2 += (col_size_log2 < 0);
> > >>>
> > >>>    const unsigned col = (pages - prev_row_max_pages +
> > >>>                          ((1 << col_size_log2) - 1)) >>
> > >>> col_size_log2;
> > >>>
> > >>> I also found open-coding the first DIV_ROUND_UP as (size +
> > >>> PAGE_SIZE
> > >>> -
> > >>> 1) / PAGE_SIZE generated very slightly better code.  That saves
> > >>> about
> > >>> 8 bytes of code... no idea if it's faster or not, but it did avoid
> > >>> a dependency on
> > >the flags.
> > >>>
> > >>> You should try my different versions of this routine in your benchmark.
> > >>>
> > >>>> +
> > >>>> +      /* Using the calculated row and column to index into the matrix */
> > >>>> +      index = (row << 2) + (col - 1);
> > >>>>     }
> > >>>>
> > >>>> -   return NULL;
> > >>>> +   /* Checking the error condition */
> > >>>
> > >>
> > >> During the intial optimization we tried similar things, but however
> > >> the above
> > >logic doesn’t seems to be the standard way of finding the row and column.
> > >>  In our algorithm, we find the row and Column of the matrix using
> > >> arithmetic and geometric progression. Since the initialization of
> > >> cache buckets
> > >follow the progressive(arithmetic and geometric) way, we used the same
> way.
> > >>
> > >> In any logic, for finding the index for row==0, we needed to do
> > >> some extra
> > >operations.
> > >
> > >The whole purpose of this change is to improve performance.  The
> > >alternate calculation that I proposed generates fewer instructions.
> > >The code is clear, concise, and understandable.  If we're going to
> > >invest the effort to optimize something, we should do that to the
> > >fullest extent possible without making jeopardizing the future
> > >maintainability of
> > the code.
> > >

Thanks Ian. I did some quick profiling on the changes suggested above where we see the second optimized version consumes very less cycles and generates less instructions 
compared to the other versions as well the changes in the original patch. I will update the patch with the suggested changes after completing the  conformance tests for submission.

> > >>> Assuming the index >= 0 check is removed, per my suggestion above,
> > >>> I don't feel like this is a very useful comment.  If index is >=
> > >>> num_buckets, then the size is too large?  The cache bucket hasn't
> > >>> been
> > >created yet?
> > >>>
> > >>>> +   return (index >= 0 && index < bufmgr->num_buckets) ?
> > >>>> +          (&bufmgr->cache_bucket[index]) : NULL;
> > >>>              ^                            ^
> > >>>              Remove these extraneous parenthesis.
> > >>>
> > >>>>  }
> > >>
> > >> Earlier, if the input size is greater or an invalid size, the
> > >> original  code was
> > >returning NULL, the same behavior is maintained whenever the input
> > >size or calculated index is invalid.
> > >
> > >The checks are fine.  I was pointing out that "Checking the error
> > >condition" is not a useful commentary of those checks.
> > >
> > >>>>
> > >>>>  int
> > >>>> @@ -1239,6 +1267,10 @@ add_bucket(struct brw_bufmgr *bufmgr, int
> > size)
> > >>>>     list_inithead(&bufmgr->cache_bucket[i].head);
> > >>>>     bufmgr->cache_bucket[i].size = size;
> > >>>>     bufmgr->num_buckets++;
> > >>>> +
> > >>>> +   assert(bucket_for_size(bufmgr, size) == &bufmgr->cache_bucket[i]);
> > >>>> +   assert(bucket_for_size(bufmgr, size - 2048) == &bufmgr-
> > >>cache_bucket[i]);
> > >>>> +   assert(bucket_for_size(bufmgr, size + 1) !=
> > >>>> + &bufmgr->cache_bucket[i]);
> > >>>>  }
> > >>>>
> > >>>>  static void
> > >>>>
> > >
> > >_______________________________________________
> > >mesa-dev mailing list
> > >mesa-dev at lists.freedesktop.org
> > >https://lists.freedesktop.org/mailman/listinfo/mesa-dev