[Mesa-dev] [PATCH v2] i965 : optimized bucket index calculation
Ian Romanick
idr at freedesktop.org
Fri Oct 20 04:20:44 UTC 2017
On 09/13/2017 11:43 PM, aravindan.muthukumar at intel.com wrote:
> From: Aravindan Muthukumar <aravindan.muthukumar at intel.com>
>
> Avoiding the loop which was running with O(n) complexity.
> Now the complexity has been reduced to O(1)
>
> Algorithm calculates the index using matrix method.
> Matrix arrangement is as below:
> Assuming PAGE_SIZE is 4096.
>
> 1*4096 2*4096 3*4096 4*4096
> 5*4096 6*4096 7*4096 8*4096
I think adding one more row to this chart would make it more clear. The
two rows shown also follow a simpler pattern, and that made some of the
complexity below seem confusing.
10*4096 12*4096 14*4096 16*4096
> ... ... ... ...
> ... ... ... ...
> ... ... ... max_cache_size
>
> From this matrix its cleary seen that every row
clearly
> follows the below way:
> ... ... ... n
> n+(1/4)n n+(1/2)n n+(3/4)n 2n
>
> Row is calulated as log2(size/PAGE_SIZE)
> Column is calculated as converting the difference
> between the elements to fit into power size of two
> and indexing it.
>
> Final Index is (row*4)+(col-1)
>
> Tested with Intel Mesa CI.
>
> Improves performance of 3d Mark on Broxton.
> Analyzed using Compare Perf Analyser:
> Average : 201.2 +/- 65.4836 (n=20)
Is 201 the improvement or the absolute score? Do not quote absolute scores.
> Percentage : 0.705966% +/- 0.229767% (n=20)
Eero: Can you reproduce this result on BXT or other platforms? Just
curious...
> v2: Review comments regarding cosmetics and asserts implemented
>
> Signed-off-by: Aravindan Muthukumar <aravindan.muthukumar at intel.com>
> Signed-off-by: Kedar Karanje <kedar.j.karanje at intel.com>
> Reviewed-by: Yogesh Marathe <yogesh.marathe at intel.com>
> ---
> src/mesa/drivers/dri/i965/brw_bufmgr.c | 46 ++++++++++++++++++++++++++++------
> 1 file changed, 39 insertions(+), 7 deletions(-)
>
> diff --git a/src/mesa/drivers/dri/i965/brw_bufmgr.c b/src/mesa/drivers/dri/i965/brw_bufmgr.c
> index 8017219..8013ccb 100644
> --- a/src/mesa/drivers/dri/i965/brw_bufmgr.c
> +++ b/src/mesa/drivers/dri/i965/brw_bufmgr.c
> @@ -87,6 +87,8 @@
>
> #define memclear(s) memset(&s, 0, sizeof(s))
>
> +#define PAGE_SIZE 4096
> +
> #define FILE_DEBUG_FLAG DEBUG_BUFMGR
>
> static inline int
> @@ -181,19 +183,45 @@ bo_tile_pitch(struct brw_bufmgr *bufmgr, uint32_t pitch, uint32_t tiling)
> return ALIGN(pitch, tile_width);
> }
>
> +static inline int
> +ilog2_round_up(int value)
> +{
> + assert(value != 0);
> + return 32 - __builtin_clz(value - 1);
> +}
> +
> +/*
> + * This function finds the correct bucket fit for the input size.
> + * The function works with O(1) complexity when the requested size
> + * was queried instead of iterating the size through all the buckets.
> + */
> static struct bo_cache_bucket *
> bucket_for_size(struct brw_bufmgr *bufmgr, uint64_t size)
> {
> - int i;
> + int index = -1;
I don't see how execution could get to that test without index being set
again, so it should be safe to remove the initialization.
> + int row, col = 0;
> + int pages, pages_log2;
Move the declarations of row, col, pages, and pages_log2 into the else
case, initialize them with the only assignment to them, and make them const.
Since all of these values, including index, get values derived from
size, I believe the should all be unsigned. In that case, remove the
index >= 0 test below.
>
> - for (i = 0; i < bufmgr->num_buckets; i++) {
> - struct bo_cache_bucket *bucket = &bufmgr->cache_bucket[i];
> - if (bucket->size >= size) {
> - return bucket;
> - }
> + /* condition for size less than 4*4096 (16KB) page size */
^ ^ Remove one space here.
|
Capitalize and add a period at the end of the sentence.
> + if(size <= 4 * PAGE_SIZE) {
^ Add a space here.
> + index = DIV_ROUND_UP(size, PAGE_SIZE) - 1;;
Remove extra trailing semicolon. ^
> + } else {
> + /* Number of pages of page size */
> + pages = DIV_ROUND_UP(size, PAGE_SIZE);
> + pages_log2 = ilog2_round_up(pages) - 1;
Isn't this going to be the same result as _mesa_logbase2(pages)?
> +
> + /* Finding the row and column of the matrix */
> + row = pages_log2 - 1;
> + col = DIV_ROUND_UP((pages - (1 << pages_log2)),
> + (1 << (pages_log2 - 2)));
^ ^
This should |
be aligned |
here. -------+
Removing some of the extra parenthesis would also make it easier to read.
So... I feel like this function doesn't need to special case size < 16k
like this. I haven't benchmarked it, but the following is about 30
bytes shorter and avoids the conditional branch. I also think it's
easier to understand.
const unsigned pages = DIV_ROUND_UP(size, PAGE_SIZE);
/* Row Bucket sizes clz((x-1) | 3) Row Column
* in pages stride size
* 0: 1 2 3 4 -> 30 30 30 30 4 1
* 1: 5 6 7 8 -> 29 29 29 29 4 1
* 2: 10 12 14 16 -> 28 28 28 28 8 2
* 3: 20 24 28 32 -> 27 27 27 27 16 4
*/
const unsigned row = 30 - __builtin_clz((pages - 1) | 3);
const unsigned row_max_pages = 4 << row;
/* The '& ~2' is the special case for row 1. In row 1, max pages /
* 2 is 2, but the previous row maximum is zero (because there is
* no previous row). All row maximum sizes are power of 2, so that
* is the only case where that bit will be set.
*/
const unsigned prev_row_max_pages = (row_max_pages / 2) & ~2;
const unsigned col_size = (row_max_pages - prev_row_max_pages) / 4;
const unsigned col = DIV_ROUND_UP(pages - prev_row_max_pages,
col_size);
/* Using the calculated row and column to index into the matrix */
const unsigned index = (row * 4) + (col - 1);
The second DIV_ROUND_UP annoys me. col_size is a power of two, so we
should be able to just shift. My attempts to take advantage of that
yielded piles of extra instructions and a conditional branch. The
problem was the row == 0 case. The best I got was:
int col_size_log2 = row - 1;
col_size_log2 += (col_size_log2 < 0);
const unsigned col = (pages - prev_row_max_pages +
((1 << col_size_log2) - 1)) >> col_size_log2;
I also found open-coding the first DIV_ROUND_UP as (size + PAGE_SIZE -
1) / PAGE_SIZE generated very slightly better code. That saves about 8
bytes of code... no idea if it's faster or not, but it did avoid a
dependency on the flags.
You should try my different versions of this routine in your benchmark.
> +
> + /* Using the calculated row and column to index into the matrix */
> + index = (row << 2) + (col - 1);
> }
>
> - return NULL;
> + /* Checking the error condition */
Assuming the index >= 0 check is removed, per my suggestion above, I
don't feel like this is a very useful comment. If index is >=
num_buckets, then the size is too large? The cache bucket hasn't been
created yet?
> + return (index >= 0 && index < bufmgr->num_buckets) ?
> + (&bufmgr->cache_bucket[index]) : NULL;
^ ^
Remove these extraneous parenthesis.
> }
>
> int
> @@ -1239,6 +1267,10 @@ add_bucket(struct brw_bufmgr *bufmgr, int size)
> list_inithead(&bufmgr->cache_bucket[i].head);
> bufmgr->cache_bucket[i].size = size;
> bufmgr->num_buckets++;
> +
> + assert(bucket_for_size(bufmgr, size) == &bufmgr->cache_bucket[i]);
> + assert(bucket_for_size(bufmgr, size - 2048) == &bufmgr->cache_bucket[i]);
> + assert(bucket_for_size(bufmgr, size + 1) != &bufmgr->cache_bucket[i]);
> }
>
> static void
>
More information about the mesa-dev
mailing list