[Mesa-dev] [PATCH 18/25] i965: Import surface lowering code.

Tue Dec 31 14:35:04 PST 2013

On 2 December 2013 11:39, Francisco Jerez <currojerez at riseup.net> wrote:

> +   fs_inst &
> +   exec_predicated(backend_reg flag, fs_inst &inst)
> +   {
> +      if (flag.file != BAD_FILE) {
> +         inst.predicate = BRW_PREDICATE_NORMAL;
> +         inst.flag_subreg = flag.fixed_hw_reg.subnr / 2;
> +      }
>

Under what circumstances would flag.file == BAD_FILE?  Should we assert
that flag.file != BAD_FILE?

> +/**
> + * Copy one of the halves of a SIMD16 vector to a SIMD8 vector.
> + */
> +void
> +brw_fs_surface_visitor::emit_pack_vector_half(
> +   fs_reg dst, fs_reg src,
> +   unsigned i, unsigned size) const
>

The meaning of the "i" parameter isn't immediately clear.  How about if we
rename it to "which_half"?

> +/**
> + * Copy a SIMD8 vector to one of the halves of a SIMD16 vector.
> + */
> +void
> +brw_fs_surface_visitor::emit_unpack_vector_half(
> +   fs_reg dst, fs_reg src,
> +   unsigned i, unsigned size) const
>

Same comment applies here.

> +/**
> + * Initialize the header present in some surface access messages.
> + */
> +void
> +brw_fs_surface_visitor::emit_surface_header(struct fs_reg dst) const
>
+{
> +   assert(dst.file == MRF);
> +   exec_all(exec_half(0, emit(BRW_OPCODE_MOV, dst, 0)));
> +   exec_all(emit(BRW_OPCODE_MOV, brw_uvec_mrf(1, dst.reg, 7),
> +                 get_sample_mask(v)));
> +}
>

Since different messages have different header data, it would be helpful
for the comment to say which messages this function is expected to be used
for, or to include a bspec reference that shows the header format for the
messages you care about.

Also, why is it necessary to call brw_uvec_mrf()?  Isn't dst already an mrf?

> +backend_reg
> +brw_fs_surface_visitor::emit_coordinate_address_calculation(
> +   backend_reg image, backend_reg addr, unsigned dims) const
>

It would be nice to have a comment above this function clarifying that it
converts from surface coordinates to a raw memory offset, so therefore it
is only needed when accessing the hardware surface in RAW mode.

> +{
> +   fs_reg x = retype(offset(addr, 0), BRW_REGISTER_TYPE_UD);
> +   fs_reg y = retype(offset(addr, 1), BRW_REGISTER_TYPE_UD);
> +   fs_reg z = retype(offset(addr, 2), BRW_REGISTER_TYPE_UD);
> +   fs_reg offset_x = offset(image, BRW_IMAGE_PARAM_OFFSET_OFFSET + 0);
> +   fs_reg offset_y = offset(image, BRW_IMAGE_PARAM_OFFSET_OFFSET + 1);
> +   fs_reg stride_x = offset(image, BRW_IMAGE_PARAM_STRIDE_OFFSET + 0);
> +   fs_reg stride_y = offset(image, BRW_IMAGE_PARAM_STRIDE_OFFSET + 1);
> +   fs_reg stride_z = offset(image, BRW_IMAGE_PARAM_STRIDE_OFFSET + 2);
> +   fs_reg stride_w = offset(image, BRW_IMAGE_PARAM_STRIDE_OFFSET + 3);
>

It would be easier to understand if stride_z and stride_w were named
horiz_stride_z and vert_stride_z (or something similar) to reflect the
terminology used in the documentation of struct brw_image_param.

> +   fs_reg tile_x = offset(image, BRW_IMAGE_PARAM_TILING_OFFSET + 0);
> +   fs_reg tile_y = offset(image, BRW_IMAGE_PARAM_TILING_OFFSET + 1);
> +   fs_reg tile_z = offset(image, BRW_IMAGE_PARAM_TILING_OFFSET + 2);
> +   fs_reg swz_x = offset(image, BRW_IMAGE_PARAM_SWIZZLING_OFFSET + 0);
> +   fs_reg swz_y = offset(image, BRW_IMAGE_PARAM_SWIZZLING_OFFSET + 1);
> +   fs_reg high_x = make_grf(BRW_REGISTER_TYPE_UD, 1);
> +   fs_reg high_y = make_grf(BRW_REGISTER_TYPE_UD, 1);
> +   fs_reg high_z = make_grf(BRW_REGISTER_TYPE_UD, 1);
> +   fs_reg dst = make_grf(BRW_REGISTER_TYPE_UD, 1);
> +   fs_reg zero = make_grf(BRW_REGISTER_TYPE_UD, 1)
> +      .apply_stride(0);
>

It would be helpful to have a comment above this block of code explaining
that these values come from the brw_image_param struct, so people know
where to look to find more information about what they mean.

> +
> +   exec_all(emit(BRW_OPCODE_MOV, zero, 0));
> +
> +   /* Shift the coordinates by the fixed surface offset. */
> +   emit(BRW_OPCODE_ADD, x, x, offset_x);
>

Is it safe to write to x, y, and z in this code?  What if addr is a
variable that gets re-used later in the shader?  My preference would be to
make temporary copies for x, y, and z, and rely on register coalescing to
get rid of the copies when they aren't needed.

> +   if (dims > 1)
> +      emit(BRW_OPCODE_ADD, y, y, offset_y);
> +
> +   if (dims > 2) {
> +      /* Decompose z into a major and a minor index. */
> +      emit(BRW_OPCODE_SHR, high_z, z, tile_z);
> +      emit(BRW_OPCODE_BFE, z, tile_z, zero, z);
> +
> +      /* Calculate the vertical slice offset. */
> +      emit(BRW_OPCODE_MUL, high_z, stride_w, high_z);
> +      emit(BRW_OPCODE_ADD, y, y, high_z);
> +
> +      /* Calculate the horizontal slice offset. */
> +      emit(BRW_OPCODE_MUL, z, stride_z, z);
> +      emit(BRW_OPCODE_ADD, x, x, z);
> +   }
> +
> +   if (dims > 1) {
>

This assumes that we don't tile 1D surfaces.  AFAICT, we do.  That seems
stupid, though.

I think we need to either:

(a) precede this patch with a patch that ensures that all 1D surfaces are
untiled, and then add a comment here explaining that we don't need to do
tile computations for 1D surfaces, or

(b) when dims == 1, set y=0 and then go ahead with the tiling computation.

> +      /* Decompose x and y into major and minor indices. */
> +      emit(BRW_OPCODE_SHR, high_x, x, tile_x);
> +      emit(BRW_OPCODE_SHR, high_y, y, tile_y);
> +
> +      emit(BRW_OPCODE_BFE, x, tile_x, zero, x);
> +      emit(BRW_OPCODE_BFE, y, tile_y, zero, y);
> +
> +      /* Calculate the pixel index from the start of the tile row.
> +       * Equivalent to:
> +       *   dst = (high_x << tile_y << tile_x) + (low_y << tile_x) + low_x
> +       */
> +      emit(BRW_OPCODE_SHL, high_x, high_x, tile_y);
> +      emit(BRW_OPCODE_ADD, dst, high_x, y);
> +      emit(BRW_OPCODE_SHL, dst, dst, tile_x);
> +      emit(BRW_OPCODE_ADD, dst, dst, x);
> +
> +      /* Multiply by the Bpp value. */
> +      emit(BRW_OPCODE_MUL, dst, dst, stride_x);
> +
> +      /* Add it to the start offset of the tile row. */
> +      emit(BRW_OPCODE_MUL, high_y, stride_y, high_y);
> +      emit(BRW_OPCODE_SHL, high_y, high_y, tile_y);
> +      emit(BRW_OPCODE_ADD, dst, dst, high_y);
> +
> +      if (v->brw->has_swizzling) {
> +         fs_reg bit_x = make_grf(BRW_REGISTER_TYPE_UD, 1);
> +         fs_reg bit_y = make_grf(BRW_REGISTER_TYPE_UD, 1);
> +
> +         /* Take into account the two dynamically specified shifts. */
> +         emit(BRW_OPCODE_SHR, bit_x, dst, swz_x);
> +         emit(BRW_OPCODE_SHR, bit_y, dst, swz_y);
>

It would be nice to have a comment here explaining that if swz_x or swz_y
is 0xff, this will shift dst to the right by 31 bits, which is guaranteed
to produce a 0 (since no image occupies more than 2G of memory).

> +
> +         /* XOR bit_x and bit_y with bit 6 of the memory address. */
> +         emit(BRW_OPCODE_XOR, bit_x, bit_x, bit_y);
> +         emit(BRW_OPCODE_AND, bit_x, bit_x, 1 << 6);
> +         emit(BRW_OPCODE_XOR, dst, dst, bit_x);
> +      }
> +
> +   } else {
> +      /* Multiply by the Bpp value. */
> +      emit(BRW_OPCODE_MUL, dst, x, stride_x);
> +   }
> +
> +   return dst;
> +}
> +
> +backend_reg
> +brw_fs_surface_visitor::emit_typed_read(
> +   backend_reg flag, backend_reg surface, backend_reg addr,
> +   unsigned dims, unsigned size) const
> +{
> +   fs_reg dst = make_grf(BRW_REGISTER_TYPE_UD, size);
> +   const unsigned w = v->dispatch_width / 8;
> +
>
+   for (unsigned i = 0; i < w; ++i) {
>

It would be helpful to have a short comment just above the for loop
explaining that for SIMD16, we're splitting the read into two SIMD8 reads
because SIMD16 typed read is not supported.  A similar comment applies to
the emit_typed_write() and emit_typed_atomic() functions.

> +backend_reg
> +brw_fs_surface_visitor::emit_pad(
> +   backend_reg flag, backend_reg src, unsigned size) const
>

This function needs a comment explaining what it does and why it's
necessary.

> +backend_reg
> +brw_fs_surface_visitor::emit_pack_generic(
> +   backend_reg src,
> +   unsigned shift_r, unsigned width_r,
> +   unsigned shift_g, unsigned width_g,
> +   unsigned shift_b, unsigned width_b,
> +   unsigned shift_a, unsigned width_a) const
>

This function could also use a comment explaining what it does.  In
particular, it should be clear from looking at the comments above this
function and emit_pack_homogeneous() how the two functions differ.

> +{
> +   const unsigned shift[] = { shift_r, shift_g, shift_b, shift_a };
> +   const unsigned width[] = { width_r, width_g, width_b, width_a };
> +   const unsigned bits = width_r + width_g + width_b + width_a;
> +   fs_reg dst = make_grf(BRW_REGISTER_TYPE_UD, bits / 32);
> +   bool seen[4] = {};
>

If bits > 128, access to seen[] will overflow.  How about if we add:

assert(bits <= 128);

Also, "dword_written" would be a more descriptive name than "seen", and
would be consistent with the terminology you use in comments below.  A
similar comment applies to emit_pack_homogeneous().

> +
> +   for (unsigned i = 0; i < Elements(width); ++i) {
> +      if (width[i]) {
> +         const unsigned j = shift[i] / 32;
> +         const unsigned k = shift[i] % 32;
> +         const unsigned m = (1ull << width[i]) - 1;
>

The single-letter variable names make this code really hard to follow.  How
about "dw_offset", "bit_offset", and "bit_mask"?  A similar comment applies
to emit_pack_homogeneous().

> +         fs_reg tmp = make_grf(BRW_REGISTER_TYPE_UD, 1);
> +
> +         if (seen[j]) {
> +            /* Insert the source value into the bit field if we have
> +             * already written to this dword.
> +             */
> +            emit(BRW_OPCODE_MOV, tmp, m << k);
> +            emit(BRW_OPCODE_BFI2, offset(dst, j),
> +                 tmp, offset(src, i), offset(dst, j));
> +
> +         } else {
> +            /* Otherwise just mask and copy the value over. */
> +            emit(BRW_OPCODE_AND, offset(dst, j),
> +                 offset(src, i), m);
> +
> +            if (k)
> +               emit(BRW_OPCODE_SHL, offset(dst, j),
> +                    offset(dst, j), k);
> +
> +            seen[j] = true;
> +         }
> +      }
> +   }
> +
> +   return dst;
> +}
> +
> +backend_reg
> +brw_fs_surface_visitor::emit_unpack_generic(
> +   backend_reg src,
> +   unsigned shift_r, unsigned width_r,
> +   unsigned shift_g, unsigned width_g,
> +   unsigned shift_b, unsigned width_b,
> +   unsigned shift_a, unsigned width_a) const
>

This function also needs a comment explaining what it does.

> +{
> +   const unsigned shift[] = { shift_r, shift_g, shift_b, shift_a };
> +   const unsigned width[] = { width_r, width_g, width_b, width_a };
> +   const unsigned n = !!width_r + !!width_g + !!width_b + !!width_a;
>

How about "num_components" instead of "n"?

>
> +   fs_reg dst = make_grf(src.type, n);
> +
> +   for (unsigned i = 0; i < Elements(width); ++i) {
> +      if (width[i]) {
>

I'm also concerned because certain combinations of widths will cause this
function to emit code that overflows the output register.  For example, if
width_r == width_g == width_b == 0 and width_a is nonzero, then n will be
1, but the "a" component will get unpacked to offset(dst, 3).  How about if
we add the assertion:

assert(i < n);

> +         /* Discard the most significant bits. */
> +         emit(BRW_OPCODE_SHL, offset(dst, i),
> +              offset(src, shift[i] / 32),
> +              32 - shift[i] % 32 - width[i]);
> +
> +         /* Shift it back to the least significant bits using an
> +          * arithmetic shift to get sign extension on signed types.
> +          */
> +         emit(BRW_OPCODE_ASR, offset(dst, i),
> +              offset(dst, i), 32 - width[i]);
> +      }
> +   }
> +
> +   return dst;
> +}
> +
> +namespace {
> +   unsigned
> +   type_for_width(unsigned width)
> +   {
> +      switch (width) {
> +      case 8:
> +         return BRW_REGISTER_TYPE_UB;
> +      case 16:
> +         return BRW_REGISTER_TYPE_UW;
> +      case 32:
> +         return BRW_REGISTER_TYPE_UD;
> +      default:
> +         unreachable();
>

Change to an assert.

> +      }
> +   }
> +}
> +
> +backend_reg
> +brw_fs_surface_visitor::emit_pack_homogeneous(
> +   backend_reg src,
> +   unsigned shift_r, unsigned width_r,
> +   unsigned shift_g, unsigned width_g,
> +   unsigned shift_b, unsigned width_b,
> +   unsigned shift_a, unsigned width_a) const
>

This function also needs a comment explaining what it does.  In particular,
is it equivalent functionality to emit_pack_generic() but optimized based
on certain assumptions about the shifts and widths?  Or does it intended to
do something different?  emit_unpack_homogeneous() needs a similar comment.

> +{
> +   const unsigned shift[] = { shift_r, shift_g, shift_b, shift_a };
> +   const unsigned width[] = { width_r, width_g, width_b, width_a };
> +   const unsigned type = type_for_width(width[0]);
> +   fs_reg dst = make_grf(BRW_REGISTER_TYPE_UD, type_sz(type));
> +   fs_reg csrc = retype(fs_reg(src), type).apply_stride(4 /
> type_sz(type));
> +   fs_reg cdst = retype(dst, type).apply_stride(4 / type_sz(type));
>
+   bool seen[4] = {};
> +
> +   for (unsigned i = 0; i < 4; ++i) {
> +      if (width[i]) {
>

It looks like this code relies on the widths and shifts satisfying a few
invariants.  Can we add these assertions to clarify that?

assert(width[i] == width[0]);
assert(shift[i] % 8 == 0);

Similar assertions should be added to emit_unpack_homogeneous().

> +         const unsigned j = shift[i] / 32;
> +         const unsigned k = shift[i] % 32;
> +
> +         if (seen[j]) {
> +            /* Insert the source value into the bit field if we have
> +             * already written to this dword.
> +             */
> +            emit(BRW_OPCODE_MOV, offset(byte_offset(cdst, k / 8), j),
> +                 offset(csrc, i));
> +
> +         } else {
> +            /* Otherwise overwrite the whole dword to make sure that
> +             * unused fields are initialized to zero.
> +             */
> +            emit(BRW_OPCODE_SHL, offset(dst, j),
> +                 offset(csrc, i), k);
> +
> +            seen[j] = true;
> +         }
> +      }
> +   }
> +
> +   return dst;
> +}
> +
> +backend_reg
> +brw_fs_surface_visitor::emit_unpack_homogeneous(
> +   backend_reg src,
> +   unsigned shift_r, unsigned width_r,
> +   unsigned shift_g, unsigned width_g,
> +   unsigned shift_b, unsigned width_b,
> +   unsigned shift_a, unsigned width_a) const
> +{
> +   const unsigned shift[] = { shift_r, shift_g, shift_b, shift_a };
> +   const unsigned width[] = { width_r, width_g, width_b, width_a };
> +   const unsigned type = type_for_width(width[0]);
> +   fs_reg tmp = retype(fs_reg(src), type).apply_stride(4 / type_sz(type));
> +   fs_reg dst = make_grf(src.type, 4);
>

Why does this use a size of 4 rather than the number of nonzero widths, as
emit_unpack_generic() does?

> +
> +   for (unsigned i = 0; i < 4; ++i) {
> +      if (width[i])
> +         emit(BRW_OPCODE_MOV,
> +              offset(dst, i),
> +              offset(byte_offset(tmp, shift[i] % 32 / 8), shift[i] / 32));
>

Won't this do the wrong thing for signed types?  It seems like you would
need type_for_width to return a signed type to get the proper sign
extension.

> +   }
> +
> +   return dst;
> +}
> ++backend_reg
> +brw_fs_surface_visitor::emit_convert_from_scaled(
> +   backend_reg src,
> +   unsigned mask0, float scale0,
> +   unsigned mask1, float scale1) const
> +{
> +   const unsigned mask[] = { mask0, mask1 };
> +   const float scale[] = { scale0, scale1 };
> +   fs_reg dst = retype(src, BRW_REGISTER_TYPE_F);
> +
> +   for (unsigned i = 0; i < Elements(mask); ++i) {
> +      for (unsigned j = 0; j < 4; ++j) {
> +         if (mask[i] & (1 << j)) {
> +            /* Convert to float and divide by the normalization
> +             * constant.
> +             */
> +            emit(BRW_OPCODE_MOV, offset(dst, j), offset(src, j));
> +            emit(BRW_OPCODE_MUL, offset(dst, j), offset(dst, j),
> +                    fs_reg(1.0f / scale[i]));
> +
> +            /* Clamp to the minimum value. */
> +            if (type_is_signed(src.type))
> +               emit(BRW_OPCODE_SEL, offset(dst, j),
> +                    offset(dst, j), -1.0f)
> +                  .conditional_mod = BRW_CONDITIONAL_G;
>

It would be nice to have a short comment explaining why it's not necessary
to clamp to +1.0.

+         }
> +      }
> +   }
> +
> +   return dst;
> +}
> +
> +backend_reg
> +brw_fs_surface_visitor::emit_convert_to_scaled(
> +   backend_reg src, unsigned type,
> +   unsigned mask0, float scale0,
> +   unsigned mask1, float scale1) const
> +{
> +   const unsigned mask[] = { mask0, mask1 };
> +   const float scale[] = { scale0, scale1 };
> +   fs_reg dst = retype(src, type);
> +
> +   for (unsigned i = 0; i < Elements(mask); ++i) {
> +      for (unsigned j = 0; j < 4; ++j) {
> +         if (mask[i] & (1 << j)) {
> +            /* Clamp to the minimum value. */
> +            if (type_is_signed(type))
> +               emit(BRW_OPCODE_SEL, offset(src, j),
> +                    offset(src, j), -1.0f)
> +                  .conditional_mod = BRW_CONDITIONAL_G;
>

I'm not comfortable with the fact that a lot of these conversion functions
write to their source registers.  It's safe now, since you only call the
conversion functions from contexts where the source is never going to be
used again, but if someone comes back and modifies the code in the future,
it's not going to be obvious that they have to preserve that invariant.
I'd rather we used temporary registers.  I think register coalescing and
register allocation are smart enough that there won't be a penalty on the
final optimizead code.

> +
> +            /* Clamp to the maximum value. */
> +            emit(BRW_OPCODE_SEL, offset(src, j),
> +                 offset(src, j), 1.0f)
> +               .conditional_mod = BRW_CONDITIONAL_L;
>
+
> +            /* Multiply by the normalization constant and convert to
> +             * integer.
> +             */
> +            emit(BRW_OPCODE_MUL, offset(src, j), offset(src, j),
> +                    scale[i]);
> +            emit(BRW_OPCODE_MOV, offset(dst, j), offset(src, j));
> +         }
> +      }
> +   }
> +
> +   return dst;
> +}
> +
> +backend_reg
> +brw_fs_surface_visitor::emit_convert_from_float(
> +   backend_reg src,
> +   unsigned mask0, unsigned width0,
> +   unsigned mask1, unsigned width1) const
> +{
> +   const unsigned mask[] = { mask0, mask1 };
> +   const unsigned width[] = { width0, width1 };
> +   fs_reg dst = retype(src, BRW_REGISTER_TYPE_F);
> +
> +   for (unsigned i = 0; i < Elements(mask); ++i) {
> +      for (unsigned j = 0; j < 4; ++j) {
> +         if (mask[i] & (1 << j)) {
> +            /* Extend 10-bit and 11-bit floating point numbers to 15
> +             * bits.  This works because they have a 5-bit exponent
> +             * just like the 16-bit floating point format, and they
> +             * have no sign bit.
> +             */
> +            if (width[i] < 16)
> +               emit(BRW_OPCODE_SHL, offset(src, j),
> +                       offset(src, j), 15 - width[i]);
> +
> +            /* Convert to a 32-bit float. */
> +            emit(BRW_OPCODE_F16TO32, offset(dst, j), offset(src, j));
> +         }
> +      }
> +   }
> +
> +   return dst;
> +}
> +
> +backend_reg
> +brw_fs_surface_visitor::emit_convert_to_float(
> +   backend_reg src,
> +   unsigned mask0, unsigned width0,
> +   unsigned mask1, unsigned width1) const
> +{
> +   const unsigned mask[] = { mask0, mask1 };
> +   const unsigned width[] = { width0, width1 };
> +   fs_reg dst = retype(src, BRW_REGISTER_TYPE_UD);
> +
> +   for (unsigned i = 0; i < Elements(mask); ++i) {
> +      for (unsigned j = 0; j < 4; ++j) {
> +         if (mask[i] & (1 << j)) {
> +            /* Clamp to the minimum value. */
> +            if (width[i] < 16)
> +               emit(BRW_OPCODE_SEL, offset(src, j),
> +                       offset(src, j), 0.0f)
> +                  .conditional_mod = BRW_CONDITIONAL_G;
> +
> +            /* Convert to a 16-bit float. */
> +            emit(BRW_OPCODE_F32TO16, offset(dst, j), offset(src, j));
> +
> +            /* Discard the least significant bits to get a floating
> +             * point number of the requested width.  This works
> +             * because the 10-bit and 11-bit floating point formats
> +             * have a 5-bit exponent just like the 16-bit format, and
> +             * they have no sign bit.
> +             */
> +            if (width[i] < 16)
> +               v->emit(BRW_OPCODE_SHR, offset(dst, j),
> +                       offset(dst, j), 15 - width[i]);
>

There's a corner case with NaN, and I can't decide whether it's important
enough to worry about.  Technically, any value with an exponent of 0b11111
and a nonzero mantissa is NaN.  If the mantissa is too small, truncating
its lower bits will change it to Inf.  Obviously, dealing with this corner
case would require emitting more instructions, so I don't know if it's
worth it.

Also, I wonder if we should try to implement proper round-to-nearest-even
behaviour here.  The best way I can think of to implement it is 4
instructions:

MOV dst.F dst.UD
MUL dst.F dst.F (1.0/(1 << (15 - width[i])))
RNDE dst.F dst.F
MOV dst.UD dst.F

Note, however, that this has a different NaN corner case problem: if the
mantissa is too large, the RNDE will overflow.

>
> +         }
> +      }
> +   }
> +
> +   return dst;
> +}
>

I'll send more comments in a separate email.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.freedesktop.org/archives/mesa-dev/attachments/20131231/ce30f8a1/attachment-0001.html>