[Mesa-dev] [PATCH] i965: extend fast texture upload

Wed Sep 25 10:18:07 PDT 2013

On 07/22/2013 03:54 PM, Frank Henigman wrote:> Extend the fast texture upload from BGRA X-tiled to include RGBA,
 > Alpha/Luminance, and Y-tiled.  Speed improvements, measured with
 > mesa demos teximage program, on 256 x 256 texture, in MB/s, on a
 > Sandy Bridge (Ivy is comparable):
 >
 >                before  after   increase
 > BGRA/X-tiled   3266    4524    1.39x
 > BGRA/Y-tiled   1739    3971    2.28x
 > RGBA/X-tiled    474    4694    9.90x
 > RGBA/Y-tiled    477    3368    7.06x
 >     L/X-tiled   1268    1516    1.20x
 >     L/Y-tiled   1439    1581    1.10x

These improvements look great. 9x is magic.

This patch was the most difficult patch I've ever reviewed. Not because
it was a bad patch; it's good. But because what it accomplishes is
complex and accordingly demanded a lot from the reader.

I reviewed the patch for correctness and readability, and it looks good on both
accounts.  The changes that I think are needed are mostly cosmetic.

1. Multi-line non-Doxygen comments need to be changed to Intel driver style.
    That is, comments that look like this

     /*
      * Stuff.
      * More stuff.
      */

    need to be changed to look like this

      /* Stuff.
       * More stuff.
       */

     Personally, I think the correct comment style is ugly. But, that's the
     style the driver uses.

2. Comments attached to functions should be Doxygen comments. This helps IDEs
    and anyone who generates Doxygen.

    That is,

      /* I'm a function that does
       * lots of stuff.
       */
       void f()

     should be replaced with

      /**
       * I'm a function that does
       * lots of stuff.
       */
       void f()

Also, I have other suggestions inline below.

 > Above numbers are with -mssse3, which is only used for RGBA color
 > swapping.  Even without it RGBA is considerably improved.
 > Improvements are greater for 1024 x 1024 textures.  No piglit regression.
 > Non-swizzling case not fully tested, just checked that for X-tiled the
 > new code does the same copying as the old code.

Me and Paul Berry verified no Piglit regressions on an Ivybridge with
swizzling disabled.

Did you verify the swizzling case on Sandybridge or Ivybridge?

 >
 > Signed-off-by: Frank Henigman <fjhenigman at google.com>
 > ---
 >   src/mesa/drivers/dri/i965/intel_tex_subimage.c | 439 +++++++++++++++++++++----
 >   1 file changed, 370 insertions(+), 69 deletions(-)
 >
 > diff --git a/src/mesa/drivers/dri/i965/intel_tex_subimage.c b/src/mesa/drivers/dri/i965/intel_tex_subimage.c
 > index 05e684c..6df7de1 100644
 > --- a/src/mesa/drivers/dri/i965/intel_tex_subimage.c
 > +++ b/src/mesa/drivers/dri/i965/intel_tex_subimage.c
 > @@ -43,6 +43,30 @@
 >
 >   #define FILE_DEBUG_FLAG DEBUG_TEXTURE
 >
 > +#define ALIGN_DOWN(a, b) ( (a) / (b) * (b) )
 > +#define ALIGN_UP(a, b) ALIGN_DOWN( (a)+(b)-1, b )

There already exists a macro in Mesa to align-up, named ALIGN. Let's eliminate
duplicate definitions by aliasing ALIGN_UP to ALIGN.

#define ALIGN_UP(a, b) ALIGN(a, b)

 > +
 > +/*
 > + * Tile dimensions.
 > + * 'span' is the most number of bytes we can copy from a linear source
 > + * without needing to calculate a new address.
 > + */
 > +static const uint32_t xtile_width = 512;
 > +static const uint32_t xtile_height = 8;
 > +static const uint32_t xtile_span = 64;
 > +static const uint32_t ytile_width = 128;
 > +static const uint32_t ytile_height = 32;
 > +static const uint32_t ytile_span = 16;
 > +
 > +typedef void *(*pixel_copy_fn)(void *dest, const void *src, size_t n);
 > +
 > +typedef void (*tile_copy_fn)(uint32_t x0, uint32_t x1, uint32_t x2, uint32_t x3,
 > +                             uint32_t y0, uint32_t y1,
 > +                             char *dst, const char *src,
 > +                             uint32_t src_pitch,
 > +                             uint32_t swizzle_bit,
 > +                             pixel_copy_fn pixel_copy);
 > +

The name tile_copy_fn makes sense, because it's used to copy one tile at
a time.  Analogously, I expected pixel_copy_fn to be used to copy one
pixel at a time. But it isn't...  it's used to copy one *span* at
a time.  The mismatch between these two functions names confused me.

Please rename pixel_copy_fn to better reflect its use by matching the parity of
tile_copy_fn. I like span_copy_fn, if you think that's appropriate.

 >   static bool
 >   intel_blit_texsubimage(struct gl_context * ctx,
 >   		       struct gl_texture_image *texImage,
 > @@ -132,26 +156,332 @@ err:
 >      return false;
 >   }
 >
 > +#ifdef __SSSE3__
 > +static const uint8_t rgba8_permutation[16] =
 > +   { 2,1,0,3, 6,5,4,7, 10,9,8,11, 14,13,12,15 };
 > +
 > +typedef char v16 __attribute__((vector_size(16)));
 > +
 > +/* NOTE: dst must be 16 byte aligned */
 > +#define rgba8_copy_16(dst, src)                     \
 > +   *(v16*)(dst) = __builtin_ia32_pshufb128(         \
 > +       (v16) __builtin_ia32_loadups((float*)(src)), \
 > +      *(v16*) rgba8_permutation                     \
 > +   )
 > +#endif
 > +
 > +/*
 > + * Copy RGBA to BGRA - swap R and B.
 > + */
 > +static inline void *
 > +rgba8_copy(void *dst, const void *src, size_t bytes)
 > +{
 > +   uint8_t *d = dst;
 > +   uint8_t const *s = src;
 > +
 > +#ifdef __SSSE3__
 > +   /*
 > +    * Fast copying for tile spans.
 > +    *
 > +    * As long as the destination texture is 16 aligned,
 > +    * any 16 or 64 spans we get here should also be 16 aligned.
 > +    */
 > +
 > +   if (bytes == 16) {
 > +      assert(!(((uintptr_t)dst) & 0xf));
 > +      rgba8_copy_16(d+ 0, s+ 0);
 > +      return dst;
 > +   }
 > +
 > +   if (bytes == 64) {
 > +      assert(!(((uintptr_t)dst) & 0xf));
 > +      rgba8_copy_16(d+ 0, s+ 0);
 > +      rgba8_copy_16(d+16, s+16);
 > +      rgba8_copy_16(d+32, s+32);
 > +      rgba8_copy_16(d+48, s+48);
 > +      return dst;
 > +   }
 > +#endif
 > +
 > +   while (bytes >= 4) {
 > +      d[0] = s[2];
 > +      d[1] = s[1];
 > +      d[2] = s[0];
 > +      d[3] = s[3];
 > +      d += 4;
 > +      s += 4;
 > +      bytes -= 4;
 > +   }
 > +   return dst;
 > +}
 > +
 > +/*
 > + * The tile copy functions copy to one tile from a linear source.
 > + *
 > + * Each row is copied in three parts: [x0,x1), [x1,x2), [x2,x3).
 > + * These ranges are in bytes, i.e. pixels * bytes-per-pixel.
 > + * The first and last are less than a span and the middle is a whole
 > + * number of spans.  The destination of each span in the tile is the
 > + * sum of an x offset 'xo' and a y offset 'yo' which is optionally
 > + * xor-ed with a 'swizzle' bit.
 > + * The x and y ranges to be copied must not cross any tile boundaries.
 > + * 'dst' is the start of the tile and 'src' is the corresponding
 > + * address to copy from, though copying begins at (x0, y0).
 > + */
 > +

This comment needs several fixes.

1. There should be empty line between comment and function.

2. Make it a Doxygen comment by changing its format to

    /**
     * I'm a function
     * that does stuff.
     */
     static inline void
     xtile_copy(...)

3. To aid people who are trying to understand this difficult code, move
    the comment to the top-level tile_copy_fn. Then, at each tile copy
    function, copy the tile_copy_fn documentation with the Doxygen
    \copydoc command and add any function-specific comments. See below
    for an example. Otherwise, in debugging someone may arrive at
    ytile_copy and never find the comments because they don't know to
    look at xtile_copy.

    /**
     * The tile copy functions do stuff.
     *
     * They do stuff with foo and bar.
     */
    typedef void (*tile_copy_fn)(...);

    ...

    /**
     * \copydoc tile_copy_fn
     *
     * Here insert any comments specific to xtile_copy, if any.
     */
    static inline void
    xtile_copy(...)

    ...

    /**
     * \copydoc tile_copy_fn
     *
     * Here insert any comments specific to ytile_copy, if any.
     */
    static inline void
    ytile_copy(...)

    ...

    Again for xtile_copy_faster() and ytile_copy_faster().

4. "Each row is copied in three parts: [x0,x1), [x1,x2), [x2,x3).
    These ranges are in bytes, i.e. pixels * bytes-per-pixel.
    The first and last are less than a span and the middle is a whole
    number of spans."

    This needs some clarifications. I suggest the following, but please modify
    it as you see best. Also, I changed some phrasing from "description" to
    "requirement".

    "Each row is copied in three parts: [x0,x1), [x1,x2), [x2,x3).  These
    ranges are in bytes, i.e. pixels * bytes-per-pixel.  The length of of
    the first and last intervals must be less than a span, and may be
    zero.  The length of the middle must be a whole (possibly zero)
    number of spans.

5. "sum of an x offset 'xo' and a y offset 'yo' which is optionally
    xor-ed with a 'swizzle' bit."

    This sub-comment describes one of the function's  implementation
    details.  It's presence doesn't help explain the function's contract;
    instead, its presence dilutes and confuses the main-comment. Since
    it's an implementation detail, move this sub-comment into the
    function body.

6. One last thing. It would be nice if the comment explicitly stated
    that the function "copies a region whose size is no larger than
    a tile". One could infer this from the comment, but the clarification
    would help.

 > +static inline void
 > +xtile_copy(uint32_t x0, uint32_t x1, uint32_t x2, uint32_t x3,
 > +           uint32_t y0, uint32_t y1,
 > +           char *dst, const char *src,
 > +           uint32_t src_pitch,
 > +           uint32_t swizzle_bit,
 > +           pixel_copy_fn pixel_copy)
 > +{
 > +   uint32_t xo, yo;
 > +
 > +   src += y0 * src_pitch;
 > +
 > +   for (yo = y0 * xtile_width; yo < y1 * xtile_width; yo += xtile_width) {
 > +      /*
 > +       * Bits 9 and 10 control swizzling which happens in bit 6.
 > +       * Move bits 9 and 10 three and four places respectively down
 > +       * to bit 6 and xor them.
 > +       * swizzle_bit is 1<<6 if swizzling, else 0 if not swizzling.
 > +       */

Again, Intel style for multi-line comments is:

          /* Bits 9 and 10 control swizzling which happens in bit 6.
           * Move bits 9 and 10 three and four places respectively down
           * to bit 6 and xor them.
           * swizzle_bit is 1<<6 if swizzling, else 0 if not swizzling.
           */
 > +      uint32_t swizzle = ((yo >> 3) ^ (yo >> 4)) & swizzle_bit;
 > +
 > +      pixel_copy(dst + ((x0 + yo) ^ swizzle), src + x0, x1 - x0);
 > +
 > +      for (xo = x1; xo < x2; xo += xtile_span) {
 > +         pixel_copy(dst + ((xo + yo) ^ swizzle), src + xo, xtile_span);
 > +      }
 > +
 > +      pixel_copy(dst + ((xo + yo) ^ swizzle), src + x2, x3 - x2);
 > +
 > +      src += src_pitch;
 > +   }
 > +}
 > +
 > +static inline void
 > +ytile_copy(
 > +   uint32_t x0, uint32_t x1, uint32_t x2, uint32_t x3,
 > +   uint32_t y0, uint32_t y1,
 > +   char *dst, const char *src,
 > +   uint32_t src_pitch,
 > +   uint32_t swizzle_bit,
 > +   pixel_copy_fn pixel_copy)
 > +{
 > +   const uint32_t column_width = ytile_span;
 > +   const uint32_t column_bytes = column_width * ytile_height;
 > +   /*
 > +    * Y tiles consist of 16-byte wide columns, thus as we loop over rows
 > +    * 'yo' increments by that amount, while 'xo' is the low four bits of the
 > +    * x coordinate plus the high bits multiplied by the number of bytes
 > +    * in a column.
 > +    */
 > +   uint32_t xo0 = x0 / column_width * column_bytes + x0 % column_width;
 > +   uint32_t xo1 = x1 / column_width * column_bytes + x1 % column_width;
 > +   /*
 > +    * Bit 9 controls swizzling, which happens in bit 6.
 > +    * Move bit 9 three places down to bit 6.
 > +    * swizzle_bit is 1<<6 if swizzling, else 0 if not swizzling.
 > +    */
 > +   uint32_t swizzle0 = (xo0 >> 3) & swizzle_bit;
 > +   uint32_t swizzle1 = (xo1 >> 3) & swizzle_bit;
 > +
 > +   uint32_t x, yo;
 > +
 > +   src += y0 * src_pitch;
 > +
 > +   for (yo = y0 * column_width; yo < y1 * column_width; yo += column_width) {
 > +      uint32_t xo = xo1;
 > +      uint32_t swizzle = swizzle1;
 > +
 > +      pixel_copy(dst + ((xo0 + yo) ^ swizzle0), src + x0, x1 - x0);
 > +
 > +      /* Step by spans/columns.  As it happens, the swizzle bit flips
 > +       * at each step so we don't need to calculate it explicitly. */
 > +      for (x = x1; x < x2; x += ytile_span) {
 > +         pixel_copy(dst + ((xo + yo) ^ swizzle), src + x, ytile_span);
 > +         xo += column_bytes;
 > +         swizzle ^= swizzle_bit;
 > +      }
 > +
 > +      pixel_copy(dst + ((xo + yo) ^ swizzle), src + x2, x3 - x2);
 > +
 > +      src += src_pitch;
 > +   }
 > +}
 > +
 > +/*
 > + * All these faster functions do is call a tile copy function with
 > + * constant parameters.  This can speed things up quite a bit as gcc
 > + * will inline different code optimized for each set of parameters.
 > + */
 > +
 > +static void
 > +xtile_copy_faster(uint32_t x0, uint32_t x1, uint32_t x2, uint32_t x3,

Again, (1) remove the empty line between comment function, (2) make it
a Doxygen comment, and (3) use \copydoc xtile_copy_fn.

/**
  * \copydoc tile_copy_fn
  *
  * This is a faster variant of xtile_copy(). All it does is call
  * xtile_copy() with constant parameters when possible.  This can speed
  * things up quite a bit as gcc will inline different code optimized for
  * each set of parameters.
  */
static void
xtile_copy_faster(uint32_t x0, uint32_t x1, uint32_t x2, uint32_t x3,

By the way, I read this patch from top-to-bottom, because that's how the
function dependency chain flows. When I understood the connection
between linear_tile_copy and xtile_copy, I said to myself: "But the use
of function pointers removes the potential for the compiler to optimize
the most common cases :(" And then... I found xtile_copy_faster(), and
was relieved. A suspenseful patch indeed.

 > +                  uint32_t y0, uint32_t y1,
 > +                  char *dst, const char *src,
 > +                  uint32_t src_pitch,
 > +                  uint32_t swizzle_bit,
 > +                  pixel_copy_fn pixel_copy)
 > +{
 > +   if (x0 == 0 && x3 == xtile_width && y0 == 0 && y1 == xtile_height) {
 > +      if (pixel_copy == memcpy)
 > +         return xtile_copy(0, 0, xtile_width, xtile_width, 0, xtile_height,
 > +                           dst, src, src_pitch, swizzle_bit, memcpy);
 > +      else if (pixel_copy == rgba8_copy)
 > +         return xtile_copy(0, 0, xtile_width, xtile_width, 0, xtile_height,
 > +                           dst, src, src_pitch, swizzle_bit, rgba8_copy);
 > +   } else {
 > +      if (pixel_copy == memcpy)
 > +         return xtile_copy(x0, x1, x2, x3, y0, y1,
 > +                           dst, src, src_pitch, swizzle_bit, memcpy);
 > +      else if (pixel_copy == rgba8_copy)
 > +         return xtile_copy(x0, x1, x2, x3, y0, y1,
 > +                           dst, src, src_pitch, swizzle_bit, rgba8_copy);
 > +   }
 > +   xtile_copy(x0, x1, x2, x3, y0, y1,
 > +              dst, src, src_pitch, swizzle_bit, pixel_copy);
 > +}
 > +

/**
  * \copydoc tile_copy_fn
  *
  * This is a faster variant of ytile_copy(). All it does is call
  * ytile_copy() with constant parameters when possible.  This can speed
  * things up quite a bit as gcc will inline different code optimized for
  * each set of parameters.
  */
 > +static void
 > +ytile_copy_faster(uint32_t x0, uint32_t x1, uint32_t x2, uint32_t x3,
 > +                  uint32_t y0, uint32_t y1,
 > +                  char *dst, const char *src,
 > +                  uint32_t src_pitch,
 > +                  uint32_t swizzle_bit,
 > +                  pixel_copy_fn pixel_copy)
 > +{
 > +   if (x0 == 0 && x3 == ytile_width && y0 == 0 && y1 == ytile_height) {
 > +      if (pixel_copy == memcpy)
 > +         return ytile_copy(0, 0, ytile_width, ytile_width, 0, ytile_height,
 > +                           dst, src, src_pitch, swizzle_bit, memcpy);
 > +      else if (pixel_copy == rgba8_copy)
 > +         return ytile_copy(0, 0, ytile_width, ytile_width, 0, ytile_height,
 > +                           dst, src, src_pitch, swizzle_bit, rgba8_copy);
 > +   } else {
 > +      if (pixel_copy == memcpy)
 > +         return ytile_copy(x0, x1, x2, x3, y0, y1,
 > +                           dst, src, src_pitch, swizzle_bit, memcpy);
 > +      else if (pixel_copy == rgba8_copy)
 > +         return ytile_copy(x0, x1, x2, x3, y0, y1,
 > +                           dst, src, src_pitch, swizzle_bit, rgba8_copy);
 > +   }
 > +   ytile_copy(x0, x1, x2, x3, y0, y1,
 > +              dst, src, src_pitch, swizzle_bit, pixel_copy);
 > +}
 > +
 > +/*
 > + * Copy from linear to tiled texture.
 > + *
 > + * The x range [xt1, xt2) is in bytes, i.e. pixels * bytes-per-pixel.
 > + * The x range and y range [yt1, yt2) are divided into pieces that do
 > + * not cross tile boundaries and a single-tile copy function is called
 > + * to copy each piece.
 > + * 'dst' is the start of the texture and 'src' is the corresponding
 > + * address to copy from, though copying begins at (xt1, yt1).
 > + */

Again, Doxygenate this.

Also, I has to re-read this several times before I succeeded in
reverse-engineering the intent from the passive voice. Please use the
active voice when possible.

     This funciton divides the x range [xt, xt2) and y range [yt1, yt2)
     into pieces that do not cross tile boundaries and then copies each
     piece with a tile-copy function (\ref tile_copy_fn).

 > +static void
 > +linear_to_tiled(uint32_t xt1, uint32_t xt2,
 > +                uint32_t yt1, uint32_t yt2,
 > +                char *dst, const char *src,
 > +                uint32_t dst_pitch, uint32_t src_pitch,
 > +                bool has_swizzling,
 > +                uint32_t tiling,
 > +                pixel_copy_fn pixel_copy)
 > +{
 > +   tile_copy_fn tile_copy;
 > +   uint32_t xt0, xt3;
 > +   uint32_t yt0, yt3;
 > +   uint32_t xt, yt;
 > +   uint32_t tw, th, span;
 > +   uint32_t swizzle_bit = has_swizzling ? 1<<6 : 0;

The 't' suffix means something non-obvious, so please document it. If
I understand correctly, does the 't' mean the variable is tile-aligned?

 > +
 > +   if (tiling == I915_TILING_X) {
 > +      tw = xtile_width;
 > +      th = xtile_height;
 > +      span = xtile_span;
 > +      tile_copy = xtile_copy_faster;
 > +   } else if (tiling == I915_TILING_Y) {
 > +      tw = ytile_width;
 > +      th = ytile_height;
 > +      span = ytile_span;
 > +      tile_copy = ytile_copy_faster;
 > +   } else {
 > +      assert(!"unsupported tiling");
 > +      return;
 > +   }
 > +
 > +   /* Round up to tile boundaries. */
 > +   xt0 = ALIGN_DOWN(xt1, tw);
 > +   xt3 = ALIGN_UP  (xt2, tw);
 > +   yt0 = ALIGN_DOWN(yt1, th);
 > +   yt3 = ALIGN_UP  (yt2, th);
 > +

 > +   /*
 > +    * xt and yt loop over full tiles.  tile_copy() copies one tile at a time.
 > +    * Looping x inside y is the faster memory access pattern.
 > +    */

Again, convert the mulit-line comment to Intel style.

       /* xt and yt loop over full tiles.  tile_copy() copies one tile at a time.
        * Looping x inside y is the faster memory access pattern.
        */

And again below.

 > +   for (yt = yt0; yt < yt3; yt += th) {
 > +      for (xt = xt0; xt < xt3; xt += tw) {
 > +         /*
 > +          * The area to update is [x0,x3) x [y0,y1).
 > +          * May not want the whole tile, hence the min and max.
 > +          */
 > +         uint32_t x0 = MAX2(xt1, xt);
 > +         uint32_t y0 = MAX2(yt1, yt);
 > +         uint32_t x3 = MIN2(xt2, xt + tw);
 > +         uint32_t y1 = MIN2(yt2, yt + th);
 > +
 > +         /*
 > +          * [x0,x3) is split into [x0,x1), [x1,x2), [x2,x3) such that
 > +          * the middle interval is the longest span-aligned part.
 > +          */

Please add a clarifiction that states each interval may have zero
length.

 > +         uint32_t x1, x2;
 > +         x1 = ALIGN_UP(x0, span);
 > +         if (x1 > x3)
 > +            x1 = x2 = x3;
 > +         else
 > +            x2 = ALIGN_DOWN(x3, span);
 > +
 > +         assert(x0 <= x1 && x1 <= x2 && x2 <= x3);
 > +         assert(x1 - x0 < span && x3 - x2 < span);
 > +         assert(x3 - x0 <= tw);

One more assertion, please:

     assert((x2 - x1) % span == 0);
 > +
 > +         /* Translate by (xt,yt) for single-tile copier. */
 > +         tile_copy(x0-xt, x1-xt, x2-xt, x3-xt,
 > +                   y0-yt, y1-yt,
 > +                   dst + xt * th + yt * dst_pitch,
 > +                   src + xt      + yt * src_pitch,
 > +                   src_pitch,
 > +                   swizzle_bit,
 > +                   pixel_copy);
 > +      }
 > +   }
 > +}
 > +
 >   /**
 >    * \brief A fast path for glTexImage and glTexSubImage.
 >    *
 >    * \param for_glTexImage Was this called from glTexImage or glTexSubImage?
 >    *
 > - * This fast path is taken when the hardware natively supports the texture
 > - * format (such as GL_BGRA) and when the texture memory is X-tiled. It uploads
 > + * This fast path is taken when the texture format is BGRA, RGBA,
 > + * A or L and when the texture memory is X- or Y-tiled.  It uploads
 >    * the texture data by mapping the texture memory without a GTT fence, thus
 > - * acquiring a tiled view of the memory, and then memcpy'ing sucessive
 > - * subspans within each tile.
 > + * acquiring a tiled view of the memory, and then copying sucessive
 > + * spans within each tile.
 >    *
 >    * This is a performance win over the conventional texture upload path because
 >    * it avoids the performance penalty of writing through the write-combine
 >    * buffer. In the conventional texture upload path,
 >    * texstore.c:store_texsubimage(), the texture memory is mapped through a GTT
 >    * fence, thus acquiring a linear view of the memory, then each row in the
 > - * image is memcpy'd. In this fast path, we replace each row's memcpy with
 > - * a sequence of memcpy's over each bit6 swizzle span in the row.
 > + * image is memcpy'd. In this fast path, we replace each row's copy with
 > + * a sequence of copies over each linear span in tile.
 >    *
 > - * This fast path's use case is Google Chrome's paint rectangles.  Chrome (as
 > + * One use case is Google Chrome's paint rectangles.  Chrome (as
 >    * of version 21) renders each page as a tiling of 256x256 GL_BGRA textures.
 >    * Each page's content is initially uploaded with glTexImage2D and damaged
 >    * regions are updated with glTexSubImage2D. On some workloads, the
 > @@ -176,14 +506,15 @@ intel_texsubimage_tiled_memcpy(struct gl_context * ctx,
 >
 >      int error = 0;
 >
 > -   /* This fastpath is restricted to a specific texture type: level 0 of
 > -    * a 2D BGRA texture. It could be generalized to support more types by
 > -    * varying the arithmetic loop below.
 > +   uint32_t cpp;
 > +   pixel_copy_fn pixel_copy = NULL;
 > +
 > +   /* This fastpath is restricted to specific texture types: level 0 of
 > +    * a 2D BGRA, RGBA, L8 or A8 texture. It could be generalized to support
 > +    * more types.
 >       */
 >      if (!brw->has_llc ||
 > -       format != GL_BGRA ||
 >          type != GL_UNSIGNED_BYTE ||
 > -       texImage->TexFormat != MESA_FORMAT_ARGB8888 ||
 >          texImage->TexObject->Target != GL_TEXTURE_2D ||
 >          texImage->Level != 0 ||
 >          pixels == NULL ||
 > @@ -197,12 +528,28 @@ intel_texsubimage_tiled_memcpy(struct gl_context * ctx,
 >          packing->Invert)
 >         return false;
 >
 > +   if ((texImage->TexFormat == MESA_FORMAT_L8 && format == GL_LUMINANCE) ||
 > +       (texImage->TexFormat == MESA_FORMAT_A8 && format == GL_ALPHA)) {
 > +      cpp = 1;
 > +      pixel_copy = memcpy;
 > +   } else if (texImage->TexFormat == MESA_FORMAT_ARGB8888) {
 > +      cpp = 4;
 > +      if (format == GL_BGRA) {
 > +         pixel_copy = memcpy;
 > +      } else if (format == GL_RGBA) {
 > +         pixel_copy = rgba8_copy;
 > +      }
 > +   }

Need an empty line here to separate the first if-tree from the second.

 > +   if (!pixel_copy)
 > +      return false;
 > +
 >      if (for_glTexImage)
 >         ctx->Driver.AllocTextureImageBuffer(ctx, texImage);
 >
 >      if (!image->mt ||
 > -       image->mt->region->tiling != I915_TILING_X) {
 > -      /* The algorithm below is written only for X-tiled memory. */
 > +       (image->mt->region->tiling != I915_TILING_X &&
 > +       image->mt->region->tiling != I915_TILING_Y)) {
 > +      /* The algorithm is written only for X- or Y-tiled memory. */
 >         return false;
 >      }
 >
 > @@ -236,61 +583,15 @@ intel_texsubimage_tiled_memcpy(struct gl_context * ctx,
 >      DBG("%s: level=%d offset=(%d,%d) (w,h)=(%d,%d)\n",
 >          __FUNCTION__, texImage->Level, xoffset, yoffset, width, height);
 >
 > -   /* In the tiling algorithm below, some variables are in units of pixels,
 > -    * others are in units of bytes, and others (such as height) are unitless.
 > -    * Each variable name is suffixed with its units.
 > -    */
 > -
 > -   const uint32_t x_max_pixels = xoffset + width;
 > -   const uint32_t y_max_pixels = yoffset + height;
 > -
 > -   const uint32_t tile_size_bytes = 4096;
 > -
 > -   const uint32_t tile_width_bytes = 512;
 > -   const uint32_t tile_width_pixels = 128;
 > -
 > -   const uint32_t tile_height = 8;
 > -
 > -   const uint32_t cpp = 4; /* chars per pixel of GL_BGRA */
 > -   const uint32_t swizzle_width_pixels = 16;
 > -
 > -   const uint32_t stride_bytes = image->mt->region->pitch;
 > -   const uint32_t width_tiles = stride_bytes / tile_width_bytes;
 > -
 > -   for (uint32_t y_pixels = yoffset; y_pixels < y_max_pixels; ++y_pixels) {
 > -      const uint32_t y_offset_bytes = (y_pixels / tile_height) * width_tiles * tile_size_bytes
 > -                                    + (y_pixels % tile_height) * tile_width_bytes;
 > -
 > -      for (uint32_t x_pixels = xoffset; x_pixels < x_max_pixels; x_pixels += swizzle_width_pixels) {
 > -         const uint32_t x_offset_bytes = (x_pixels / tile_width_pixels) * tile_size_bytes
 > -                                       + (x_pixels % tile_width_pixels) * cpp;
 > -
 > -         intptr_t offset_bytes = y_offset_bytes + x_offset_bytes;
 > -         if (brw->has_swizzling) {
 > -#if 0
 > -            /* Clear, unoptimized version. */
 > -            bool bit6 = (offset_bytes >> 6) & 1;
 > -            bool bit9 = (offset_bytes >> 9) & 1;
 > -            bool bit10 = (offset_bytes >> 10) & 1;
 > -
 > -            if (bit9 ^ bit10)
 > -               offset_bytes ^= (1 << 6);
 > -#else
 > -            /* Optimized, obfuscated version. */
 > -            offset_bytes ^= ((offset_bytes >> 3) ^ (offset_bytes >> 4))
 > -                          & (1 << 6);
 > -#endif
 > -         }
 > -
 > -         const uint32_t swizzle_bound_pixels = ALIGN(x_pixels + 1, swizzle_width_pixels);
 > -         const uint32_t memcpy_bound_pixels = MIN2(x_max_pixels, swizzle_bound_pixels);
 > -         const uint32_t copy_size = cpp * (memcpy_bound_pixels - x_pixels);
 > -
 > -         memcpy(bo->virtual + offset_bytes, pixels, copy_size);
 > -         pixels += copy_size;
 > -         x_pixels -= (x_pixels % swizzle_width_pixels);
 > -      }
 > -   }
 > +   linear_to_tiled(
 > +      xoffset * cpp, (xoffset + width) * cpp,
 > +      yoffset, yoffset + height,
 > +      bo->virtual, pixels - (xoffset + yoffset * width) * cpp,
 > +      image->mt->region->pitch, width * cpp,
 > +      brw->has_swizzling,
 > +      image->mt->region->tiling,
 > +      pixel_copy
 > +   );
 >
 >      drm_intel_bo_unmap(bo);
 >      return true;
 >