[Mesa-dev] [PATCH 18/25] i965: Import surface lowering code.

Fri Jan 3 11:26:55 PST 2014

On 2 December 2013 11:39, Francisco Jerez <currojerez at riseup.net> wrote:

> diff --git a/src/mesa/drivers/dri/i965/brw_vec4_surface_visitor.cpp
> b/src/mesa/drivers/dri/i965/brw_vec4_surface_visitor.cpp
> new file mode 100644
> index 0000000..3528bbe
> --- /dev/null
> +++ b/src/mesa/drivers/dri/i965/brw_vec4_surface_visitor.cpp
> @@ -0,0 +1,846 @@
> +/*
> + * Copyright 2013 Intel Corporation
> + *
> + * Permission is hereby granted, free of charge, to any person obtaining a
> + * copy of this software and associated documentation files (the
> "Software"),
> + * to deal in the Software without restriction, including without
> limitation
> + * the rights to use, copy, modify, merge, publish, distribute,
> sublicense,
> + * and/or sell copies of the Software, and to permit persons to whom the
> + * Software is furnished to do so, subject to the following conditions:
> + *
> + * The above copyright notice and this permission notice (including the
> next
> + * paragraph) shall be included in all copies or substantial portions of
> the
> + * Software.
> + *
> + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
> EXPRESS OR
> + * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
> MERCHANTABILITY,
> + * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.  IN NO EVENT
> SHALL
> + * THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR
> OTHER
> + * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
> + * FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER
> DEALINGS
> + * IN THE SOFTWARE.
> + *
> + * Authors:
> + *    Francisco Jerez <currojerez at riseup.net>
> + */
> +
> +#include "brw_vec4_surface_visitor.h"
> +
> +using namespace brw;
> +
> +namespace {
> +   vec4_instruction &
> +   exec_all(vec4_instruction &inst)
> +   {
> +      inst.force_writemask_all = true;
> +      return inst;
> +   }
> +
> +   vec4_instruction &
> +   exec_predicated(backend_reg flag, vec4_instruction &inst)
> +   {
> +      if (flag.file != BAD_FILE)
> +         inst.predicate = BRW_PREDICATE_ALIGN16_ALL4H;
>

There should be a comment here explaining why you chose this predicate (or
at the very least referring the reader to emit_coordinate_check() for
context).

> +
> +      return inst;
> +   }
> +}
> +
> +
> +/**
> + * Copy a SIMD4x2 vector to its transpose SIMD8x4 vector.
>

Usually SIMDNxM means "M separate execution flows are simultaneously
processed, with each execution flow processing a packed vector of N
elements at a time".  (Note that SIMD8 is considered an abbreviation for
SIMD1x8 in this interpretation).  So SIMD8x4 doesn't make any sense.

How about instead saying "Copy a SIMD4x2 vector to execution channels 0 and
4 of a SIMD8 vector"?

Also, there should be a comment clarifying why it's ok to leave channels
1-3 and 5-7 uninitialized.

> + */
> +void
> +brw_vec4_surface_visitor::emit_assign_to_transpose(
> +   dst_reg dst, src_reg src, unsigned size) const
> +{
> +   for (unsigned i = 0; i < size; ++i) {
> +      emit(BRW_OPCODE_MOV,
> +           writemask(offset(dst, i), WRITEMASK_X),
> +           swizzle(src, BRW_SWIZZLE4(i, i, i, i)));
> +   }
>
+}
> +
> +/**
> + * Copy a SIMD4x2 vector from its transpose SIMD8x4 vector.
>

Similar clarification is needed here.

> + */
> +void
> +brw_vec4_surface_visitor::emit_assign_from_transpose(
> +   dst_reg dst, src_reg src, unsigned size) const
> +{
> +   for (unsigned i = 0; i < size; ++i) {
> +      emit(BRW_OPCODE_MOV,
> +           writemask(dst, 1 << i),
> +           swizzle(offset(src, i), BRW_SWIZZLE_XXXX));
> +   }
> +}
> +
> +/**
> + * Initialize the header present in some surface access messages.
> + */
> +void
> +brw_vec4_surface_visitor::emit_surface_header(struct dst_reg dst) const
> +{
> +   assert(dst.file == MRF);
> +   exec_all(emit(BRW_OPCODE_MOV, dst, 0));
> +
> +   if (!v->brw->is_haswell) {
> +      /* The sample mask is used on IVB for the SIMD8 messages that
> +       * have no SIMD4x2 counterpart.  We only use the two X channels
> +       * in that case, mask everything else out.
> +       */
> +      exec_all(emit(BRW_OPCODE_MOV,
> +                    brw_writemask(brw_uvec_mrf(4, dst.reg, 4),
> WRITEMASK_W),
> +                    0x11));
> +   }
> +}
>

As with the corresponding fs function, it would be helpful for the comment
to say which messages this function is expected to be used for, or to
include a bspec reference that shows the header format for the messages you
care about.

> +backend_reg
> +brw_vec4_surface_visitor::emit_coordinate_address_calculation(
> +   backend_reg image, backend_reg addr, unsigned dims) const
>

Most of my comments on
brw_fs_surface_visitor::emit_coordinate_address_calculation() apply here as
well.

> +{
> +   const unsigned mask = (1 << dims) - 1;
> +   src_reg off = offset(image, BRW_IMAGE_PARAM_OFFSET_OFFSET / 4);
> +   src_reg stride = offset(image, BRW_IMAGE_PARAM_STRIDE_OFFSET / 4);
> +   src_reg tile = offset(image, BRW_IMAGE_PARAM_TILING_OFFSET / 4);
> +   src_reg swz = offset(image, BRW_IMAGE_PARAM_SWIZZLING_OFFSET / 4);
> +   src_reg dst = make_grf(BRW_REGISTER_TYPE_UD, 1);
> +   src_reg tmp = make_grf(BRW_REGISTER_TYPE_UD, 4);
> +
> +   /* Shift the coordinates by the fixed surface offset. */
> +   emit(BRW_OPCODE_ADD, writemask(addr, WRITEMASK_XY & mask),
> +        addr, off);
> +
> +   if (dims > 2) {
> +      /* Decompose z into a major (tmp.w) and a minor (tmp.z)
> +       * index.
> +       */
> +      emit(BRW_OPCODE_SHL, writemask(tmp, WRITEMASK_Z),
> +           addr, negate(tile));
> +
> +      emit(BRW_OPCODE_SHR, writemask(tmp, WRITEMASK_Z),
> +           tmp, negate(tile));
>

The use of negate() with a shift is sufficiently clever that it could use a
comment explaining what's going on.  In partuclar, that the SHL and SHR are
effectively preserving the lowest-order tile.z bits of addr.z and
discarding the rest, except in the case where tile.z == 0, in which case
they are keeping all the bits.

> +
> +      emit(BRW_OPCODE_SHR, writemask(tmp, WRITEMASK_W),
> +           swizzle(addr, BRW_SWIZZLE_ZZZZ),
> +           swizzle(tile, BRW_SWIZZLE_ZZZZ));
> +
> +      /* Calculate the horizontal (tmp.z) and vertical (tmp.w) slice
> +       * offset.
> +       */
> +      emit(BRW_OPCODE_MUL, writemask(tmp, WRITEMASK_ZW),
> +           stride, tmp);
> +      emit(BRW_OPCODE_ADD, writemask(addr, WRITEMASK_XY),
> +           addr, swizzle(tmp, BRW_SWIZZLE_ZWZW));
>

There should be a comment clarifying that in the case where tile.z == 0,
since tmp.z == addr.z, we are relying on the fact that stride.z is 0 in
order to avoid corrupting the contents of addr.x.  Additionally, in patch
04/25, a comment should be added to
brw_miptree_get_horizontal_slice_pitch() to clarify that in the
non-GL_TEXTURE_3D case,
brw_vec4_surface_visitor::emit_coordinate_address_calculation() relies on
the return value being 0.

> +   }
> +
> +   if (dims > 1) {
> +      /* Calculate the minor x (tmp.x) and y (tmp.y) indices. */
> +      emit(BRW_OPCODE_SHL, writemask(tmp, WRITEMASK_XY),
> +           addr, negate(tile));
> +
> +      emit(BRW_OPCODE_SHR, writemask(tmp, WRITEMASK_XY),
> +           tmp, negate(tile));
> +
> +      /* Calculate the major x (tmp.z) and y (tmp.w) indices. */
> +      emit(BRW_OPCODE_SHR, writemask(tmp, WRITEMASK_ZW),
> +           swizzle(addr, BRW_SWIZZLE_XYXY),
> +           swizzle(tile, BRW_SWIZZLE_XYXY));
> +
> +      /* Multiply the minor indices and the major x index (tmp.x,
> +       * tmp.y and tmp.w) by the Bpp, and the major y index (tmp.w) by
> +       * the vertical stride.
> +       */
> +      emit(BRW_OPCODE_MUL, writemask(tmp, WRITEMASK_XYZW),
> +           swizzle(stride, BRW_SWIZZLE_XXXY), tmp);
> +
> +      /* Multiply by the tile dimensions using two shift instructions.
> +       * Equivalent to:
> +       *   minor.y = minor.y << tile.x
> +       *   major.x = major.x << tile.x << tile.y
> +       *   major.y = major.y << tile.y
> +       */
> +      emit(BRW_OPCODE_SHL, writemask(tmp, WRITEMASK_ZW),
> +           swizzle(tmp, BRW_SWIZZLE_ZWZW),
> +           swizzle(tile, BRW_SWIZZLE_YYYY));
> +
> +      emit(BRW_OPCODE_SHL, writemask(tmp, WRITEMASK_YZ),
> +           swizzle(tmp, BRW_SWIZZLE_YYZZ),
> +           swizzle(tile, BRW_SWIZZLE_XXXX));
> +
> +      /* Add everything up. */
> +      emit(BRW_OPCODE_ADD, writemask(tmp, WRITEMASK_XY),
> +           swizzle(tmp, BRW_SWIZZLE_XYXY),
> +           swizzle(tmp, BRW_SWIZZLE_ZWZW));
> +
> +      emit(BRW_OPCODE_ADD, writemask(dst, WRITEMASK_X),
> +           swizzle(tmp, BRW_SWIZZLE_XXXX),
> +           swizzle(tmp, BRW_SWIZZLE_YYYY));
>

I believe this code calculates the wrong value when tile.x == tile.y == 0,
because all the shifts shift by zero, so just before the line "/* Add
everything up. */", tmp has the values:

tmp.x = addr.x * stride.x
tmp.y = addr.y * stride.x
tmp.z = addr.x * stride.x
tmp.w = addr.y * stride.y

And then when you add it all up you get

dst.x = addr.x * (2 * stride.x) + addr.y * (stride.x + stride.y)

Which isn't correct.  What you want is:

dst.x = addr.x * stride.x + addr.y * stride.y

> +
> +      if (v->brw->has_swizzling) {
> +         /* Take into account the two dynamically specified shifts. */
> +         emit(BRW_OPCODE_SHR, writemask(tmp, WRITEMASK_XY),
> +              swizzle(dst, BRW_SWIZZLE_XXXX), swz);
> +
> +         /* XOR tmp.x and tmp.y with bit 6 of the memory address. */
> +         emit(BRW_OPCODE_XOR, writemask(tmp, WRITEMASK_X),
> +              swizzle(tmp, BRW_SWIZZLE_XXXX),
> +              swizzle(tmp, BRW_SWIZZLE_YYYY));
> +
> +         emit(BRW_OPCODE_AND, writemask(tmp, WRITEMASK_X),
> +              tmp, 1 << 6);
> +
> +         emit(BRW_OPCODE_XOR, writemask(dst, WRITEMASK_X),
> +              dst, tmp);
> +      }
> +
> +   } else {
> +      /* Multiply by the Bpp value. */
> +      emit(BRW_OPCODE_MUL, writemask(dst, WRITEMASK_X),
> +           addr, stride);
> +   }
> +
> +   return dst;
> +}
> +
> +backend_reg
> +brw_vec4_surface_visitor::emit_untyped_read(
> +   backend_reg flag, backend_reg surface, backend_reg addr,
> +   unsigned dims, unsigned size) const
>

It would be nice to have a comment in this function saying that for both
IVB and HSW, we use SIMD4x2 messages for untyped surface read.

> +
> +void
> +brw_vec4_surface_visitor::emit_untyped_write(
> +   backend_reg flag, backend_reg surface, backend_reg addr,
> +   backend_reg src, unsigned dims, unsigned size) const
> +{
> +   const unsigned mask = (v->brw->is_haswell ? (1 << size) - 1 : 1);
> +   unsigned mlen = 0;
>

It would be nice to move the comment about IVB using a SIMD8 message up
here, since without that context, the code to emit the surface write
address and the source value don't make sense.  A similar comment applies
to emit_untyped_atomic(), emit_typed_read(), emit_typed_write(), and
emit_typed_atomic().

> +
> +   /* Set the surface write address. */
> +   if (v->brw->is_haswell) {
> +      emit_assign_with_pad(make_mrf(mlen), addr, dims);
> +      mlen++;
> +   } else {
> +      emit_assign_to_transpose(make_mrf(mlen), addr, dims);
> +      mlen += dims;
> +   }
> +
> +   /* Set the source value. */
> +   if (v->brw->is_haswell) {
> +      emit_assign_with_pad(make_mrf(mlen), src, size);
> +      mlen++;
> +   } else {
> +      emit_assign_to_transpose(make_mrf(mlen), src, size);
> +      mlen += size;
> +   }
> +
> +   /* Emit the instruction.  Note that this is translated into the
> +    * SIMD8 untyped surface write message on IVB because the
> +    * hardware lacks a SIMD4x2 counterpart.
> +    */
> +   vec4_instruction &inst = exec_predicated(
> +      flag, emit(SHADER_OPCODE_UNTYPED_SURFACE_WRITE,
> +                 brw_writemask(brw_null_reg(), mask),
> +                 surface, size));
> +   inst.base_mrf = 0;
> +   inst.mlen = mlen;
> +}
> +
> +backend_reg
> +brw_vec4_surface_visitor::emit_untyped_atomic(
> +   backend_reg flag, backend_reg surface, backend_reg addr,
> +   backend_reg src0, backend_reg src1,
> +   unsigned dims, unsigned op) const
> +{
> +   src_reg dst = make_grf(BRW_REGISTER_TYPE_UD, 1);
> +   unsigned mlen = 0;
> +
> +   /* Set the atomic operation address. */
> +   if (v->brw->is_haswell) {
> +      emit_assign_with_pad(make_mrf(mlen), addr, dims);
> +      mlen++;
> +   } else {
> +      emit_assign_to_transpose(make_mrf(mlen), addr, dims);
> +      mlen += dims;
> +   }
> +
> +   /* Set the source arguments. */
> +   if (v->brw->is_haswell) {
> +      if (src0.file != BAD_FILE)
> +         emit(BRW_OPCODE_MOV, writemask(make_mrf(mlen), WRITEMASK_X),
> +              src0);
> +
> +      if (src1.file != BAD_FILE)
> +         emit(BRW_OPCODE_MOV, writemask(make_mrf(mlen), WRITEMASK_Y),
> +              swizzle(src1, BRW_SWIZZLE_XXXX));
> +
> +      mlen++;
>

According to Graphics BSpec: 3D-Media-GPGPU Engine > Shared Functions >
Shared Functions – Data Port [Pre-SKL] > Messages > TypedUntyped Surface
ReadWrite and TypedUntyped Atomic Operation [IVB+] > Message Payload >
SIMD4x2 Source Payload (Atomic Operation message only), "The following
atomic operations require no sources, thus the source payload is not
delivered: AOP_INC, AOP_DEC, AOP_PREDEC".

I think this means that if both src0.file and src1.file are BAD_FILE, we
shouldn't increment mlen.

A similar comment applies to emit_typed_atomic().

> +
> +   } else {
> +      if (src0.file != BAD_FILE) {
> +         emit(BRW_OPCODE_MOV, writemask(make_mrf(mlen), WRITEMASK_X),
> +              src0);
> +         mlen++;
> +      }
> +
> +      if (src1.file != BAD_FILE) {
> +         emit(BRW_OPCODE_MOV, writemask(make_mrf(mlen), WRITEMASK_X),
> +              src1);
> +         mlen++;
> +      }
> +   }
> +
> +   /* Emit the instruction.  Note that this is translated into the
> +    * SIMD8 untyped atomic message on IVB because the hardware lacks a
> +    * SIMD4x2 counterpart.
> +    */
> +   vec4_instruction &inst = exec_predicated(
> +      flag, emit(SHADER_OPCODE_UNTYPED_ATOMIC,
> +                 writemask(dst, WRITEMASK_X),
> +                 surface, op));
> +   inst.base_mrf = 0;
> +   inst.mlen = mlen;
> +
> +   return dst;
> +}
> +
> +backend_reg
> +brw_vec4_surface_visitor::emit_typed_read(
> +   backend_reg flag, backend_reg surface, backend_reg addr,
> +   unsigned dims, unsigned size) const
> +{
> +   const unsigned rlen = size * (v->brw->is_haswell ? 1 : 8);
>

I think this is 2x larger than necessary for Ivy Bridge.
brw_vec4_surface_visitor::make_grf() will divide rlen by 4 to determine how
many registers to allocate.  So, for example, if size == 4, it will
allocate 8 consecutive registers to hold the result.  But the reply from
the untyped read message will only fill up 4 of them, and the call to
emit_assign_from_transpose() will only cause 4 of them to be read.

> +backend_reg
> +brw_vec4_surface_visitor::emit_pack_generic(
> +   backend_reg src,
> +   unsigned shift_r, unsigned width_r,
> +   unsigned shift_g, unsigned width_g,
> +   unsigned shift_b, unsigned width_b,
> +   unsigned shift_a, unsigned width_a) const
> +{
> +   const unsigned mask = (!!width_r << 0 | !!width_g << 1 |
> +                          !!width_b << 2 | !!width_a << 3);
> +   const bool homogeneous = ((!width_g || width_r == width_g) &&
> +                             (!width_b || width_g == width_b) &&
> +                             (!width_a || width_b == width_a));
> +   const unsigned bits = width_r + width_g + width_b + width_a;
> +   src_reg shift = make_grf(BRW_REGISTER_TYPE_UD, 4);
> +
> +   /* Shift left to discard the most significant bits. */
> +   emit(BRW_OPCODE_MOV, writemask(shift, mask),
> +        (homogeneous ? brw_imm_ud(32 - width_r) :
> +         brw_imm_vf4(32 - width_r, 32 - width_g,
> +                     32 - width_b, 32 - width_a)));
>

It seems a little silly to go to extra effort to switch between
brw_imm_ud() and brw_imm_vc4(), given that SHL and SHR only pay attention
to the bottom 5 bits of their shift argument, and brw_imm_vf4 is capable of
representing values up to 31.  How about instead:

   emit(BRW_OPCODE_MOV, writemask(shift, mask),
        brw_imm_vf4((32 - width_r) % 32, (32 - width_g) % 32,
                    (32 - width_b) % 32, (32 - width_a) % 32));

A similar comment applies to emit_unpack_generic().

> +
> +   emit(BRW_OPCODE_SHL, writemask(src, mask), src, shift);
>

As in my comments on brw_fs_surface_visitor.cpp, I don't like writing to
src.  It seems like there's too big a risk that a user of this function
will assume that src isn't written to (and can thus be used again).  I'd
prefer to allocate a new temporary and then let register coalescing take
care of getting rid of it.

> +
> +   /* Shift right to the final bit field positions. */
> +   emit(BRW_OPCODE_MOV, writemask(shift, mask),
> +        brw_imm_vf4(32 - shift_r % 32 - width_r,
> +                    32 - shift_g % 32 - width_g,
> +                    32 - shift_b % 32 - width_b,
> +                    32 - shift_a % 32 - width_a));
>

Won't this have trouble with unused channels?  For example, if shift_a ==
width_a == 0 (i.e. because channel a doesn't exist in the format we're
packing), then we'll pass 32 to brw_imm_vf4(), which isn't allowed.

I think we need to do:

   emit(BRW_OPCODE_MOV, writemask(shift, mask),
        brw_imm_vf4((32 - shift_r % 32 - width_r) % 32,
                    (32 - shift_g % 32 - width_g) % 32,
                    (32 - shift_b % 32 - width_b) % 32,
                    (32 - shift_a % 32 - width_a) % 32));

A similar comment applies to emit_unpack_generic().

> +
> +   emit(BRW_OPCODE_SHR, writemask(src, mask), src, shift);
> +
> +   /* Add everything up. */
>

Since we're not adding, let's change the comment to "OR everything
together."

> +   if (mask >> 2)
> +      emit(BRW_OPCODE_OR,
> +           writemask(src, WRITEMASK_XY),
> +           swizzle(src, BRW_SWIZZLE_XZXZ),
> +           swizzle(src, (mask >> 3 ? BRW_SWIZZLE_YWYW :
> +                         BRW_SWIZZLE_YZYZ)));
> +
> +   if (mask >> 1 && bits <= 32)
> +      emit(BRW_OPCODE_OR,
> +           writemask(src, WRITEMASK_X),
> +           swizzle(src, BRW_SWIZZLE_XXXX),
> +           swizzle(src, BRW_SWIZZLE_YYYY));
>

This code only works if a number of assumptions about the shift and width
values are correct.  Can we document (and check) those assumptions with
assertions?  E.g. something like:

   assert(shift_r + width_r <= 32);
   if (bits <= 32) {
      // Pack all channels into result.x
      assert(!width_g || shift_g + width_g <= 32);
      assert(!width_b || shift_b + width_b <= 32);
      assert(!width_a || shift_a + width_a <= 32);
   } else if (mask >> 2) {
      // Pack rg into result.x and bw into result.y
      assert(shift_g + width_g <= 32);
      assert(width_b >= 32 && shift_b + width_b <= 64);
      assert(!width_a || (width_a >= 32 && shift_a + width_a <= 64));
   } else {
      // Pack r into result.x and g into result.y
      assert(width_g >= 32 &7 shift_g + width_g <= 64);
   }

>
> +
> +   return src;
> +}
> +
> +backend_reg
> +brw_vec4_surface_visitor::emit_unpack_generic(
> +   backend_reg src,
> +   unsigned shift_r, unsigned width_r,
> +   unsigned shift_g, unsigned width_g,
> +   unsigned shift_b, unsigned width_b,
> +   unsigned shift_a, unsigned width_a) const
> +{
> +   const unsigned mask = (!!width_r << 0 | !!width_g << 1 |
> +                          !!width_b << 2 | !!width_a << 3);
> +   const bool homogeneous = ((!width_g || width_r == width_g) &&
> +                             (!width_b || width_g == width_b) &&
> +                             (!width_a || width_b == width_a));
> +   src_reg shift = make_grf(BRW_REGISTER_TYPE_UD, 4);
> +   src_reg dst = make_grf(src.type, 4);
> +
> +   /* Shift left to discard the most significant bits. */
> +   emit(BRW_OPCODE_MOV, writemask(shift, mask),
> +        brw_imm_vf4(32 - shift_r % 32 - width_r,
> +                    32 - shift_g % 32 - width_g,
> +                    32 - shift_b % 32 - width_b,
> +                    32 - shift_a % 32 - width_a));
> +
> +   emit(BRW_OPCODE_SHL, writemask(dst, mask),
> +        swizzle(src, BRW_SWIZZLE4(shift_r / 32, shift_g / 32,
> +                                  shift_b / 32, shift_a / 32)),
> +        shift);
> +
> +   /* Shift back to the least significant bits using an arithmetic
> +    * shift to get sign extension on signed types.
> +    */
> +   emit(BRW_OPCODE_MOV, writemask(shift, mask),
> +        (homogeneous ? brw_imm_ud(32 - width_r) :
> +         brw_imm_vf4(32 - width_r, 32 - width_g,
> +                     32 - width_b, 32 - width_a)));
> +
> +   emit(BRW_OPCODE_ASR, writemask(dst, mask), dst, shift);
> +
> +   return dst;
> +}
> +
> +backend_reg
> +brw_vec4_surface_visitor::emit_pack_homogeneous(
> +   backend_reg src,
> +   unsigned shift_r, unsigned width_r,
> +   unsigned shift_g, unsigned width_g,
> +   unsigned shift_b, unsigned width_b,
> +   unsigned shift_a, unsigned width_a) const
> +{
> +   /* We could do the same with less instructions if we had some way
> +    * to use Align1 addressing in the VEC4 visitor.  Just use the
> +    * general path for now...
> +    */
> +   return emit_pack_generic(src, shift_r, width_r, shift_g, width_g,
> +                            shift_b, width_b, shift_a, width_a);
> +}
> +
> +backend_reg
> +brw_vec4_surface_visitor::emit_unpack_homogeneous(
> +   backend_reg src,
> +   unsigned shift_r, unsigned width_r,
> +   unsigned shift_g, unsigned width_g,
> +   unsigned shift_b, unsigned width_b,
> +   unsigned shift_a, unsigned width_a) const
> +{
> +   /* We could do the same with less instructions if we had some way
> +    * to use Align1 addressing in the VEC4 visitor.  Just use the
> +    * general path for now...
> +    */
> +   return emit_unpack_generic(src, shift_r, width_r, shift_g, width_g,
> +                              shift_b, width_b, shift_a, width_a);
> +}
>

It seems a little odd that for fs, the generic and homogeneous packing
functions are completely separate; but for vec4, they wind up calling the
same function, which then dynamically determines whether or not we're doing
homogeneous packing/unpacking.

How about if instead we simplify the interface so that there's just an
emit_pack() function and an emit_unpack() function.  In the vec4
implementation, we can just drop the emit_{pack,unpack}_homogeneous()
functions and rename emit_{pack,unpack}_generic() to emit_{pack,unpack}.
In the fs implementation, we can do:

brw_fs_surface_visitor::emit_{pack,unpack}(...)
{
   const bool homogeneous = ((!width_g || width_r == width_g) &&
                             (!width_b || width_g == width_b) &&
                             (!width_a || width_b == width_a));
   if (homogeneous) {
      ...homogeneous logic...
   } else {
      ...generic logic...
   }
}

> +
> +backend_reg
> +brw_vec4_surface_visitor::emit_convert_to_float(
> +   backend_reg src,
> +   unsigned mask0, unsigned width0,
> +   unsigned mask1, unsigned width1) const
>

My comments about the fs implementation of emit_convert_to_float() apply
here as well.

Final thoughts about this patch, now that I've gotten through it:

1. The patch is very large, and that made reviewing it difficult.  Please
break it out into separate patches.  One possible way of breaking it up
would be: a. add the brw_surface_visitor base class.  b. Add the
brw_fs_surface_visitor class.  c. Add the brw_vec4_surface_visitor class.
That would have the added advantage that it would lead people to review the
base class before reviewing the derived classes, which I think would have
been an easier order to review the patch in.

2. The object splicing concerns I brought up in "[PATCH 06/23] i965: Define
common register base class shared between both back-ends." apply to this
patch as well.  I will try to talk to some folks at Intel next week about
my ideas to mitigate this.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.freedesktop.org/archives/mesa-dev/attachments/20140103/84523212/attachment-0001.html>