[Mesa-dev] [PATCH v3 2/3] i965: Introduce a MOV_INDIRECT opcode.

Sat Nov 14 12:35:05 PST 2015

On Fri, Nov 13, 2015 at 6:50 PM, Kenneth Graunke <kenneth at whitecape.org> wrote:
> The geometry and tessellation control shader stages both read from
> multiple URB entries (one per vertex).  The thread payload contains
> several URB handles which reference these separate memory segments.
>
> In GLSL, these inputs are represented as per-vertex arrays; the
> outermost array index selects which vertex's inputs to read.  This
> array index does not necessarily need to be constant.
>
> To handle that, we need to use indirect addressing on GRFs to select
> which of the thread payload registers has the appropriate URB handle.
> (This is before we can even think about applying the pull model!)
>
> This patch introduces a new opcode which performs a MOV from a
> source using VxH indirect addressing (which allows each of the 8
> SIMD channels to select distinct data.)
>
> Based on a patch by Jason Ekstrand.
>
> v2: Rename from INDIRECT_THREAD_PAYLOAD_MOV to MOV_INDIRECT; make it
>     a bit more generic.  Use regs_read() instead of hacking up the
>     register allocator.  (Suggested by Jason Ekstrand.)
>
> v3: Fix regs_read() to be more accurate for small unaligned regions.
>     Also rebase on Matt's work.
>
> Signed-off-by: Kenneth Graunke <kenneth at whitecape.org>
> Reviewed-by: Abdiel Janulgue <abdiel.janulgue at linux.intel.com> [v1]
>
> stash
> ---
>  src/mesa/drivers/dri/i965/brw_defines.h        | 10 ++++++++
>  src/mesa/drivers/dri/i965/brw_fs.cpp           | 28 +++++++++++++++++++++
>  src/mesa/drivers/dri/i965/brw_fs.h             |  5 ++++
>  src/mesa/drivers/dri/i965/brw_fs_cse.cpp       |  1 +
>  src/mesa/drivers/dri/i965/brw_fs_generator.cpp | 34 ++++++++++++++++++++++++++
>  src/mesa/drivers/dri/i965/brw_shader.cpp       |  2 ++
>  6 files changed, 80 insertions(+)
>
> Here's an updated version.  I changed the description of source 2 to be
> "the length of the region that could be accessed, in bytes" rather than
> "the maximum value of the indirect offset" (aka the largest possible value
> of source 1), as that's what I actually wanted - it's much easier to use.
>
> (Notably, the largest possible value of the indirect offset means that
> it affects the maximum possible base of the read...but you still read
> type_sz() bytes beyond that...which is ugly.  Size of the region is
> much easier to use, as base + size = end...and you don't read beyond
> the end.)
>
> This is all getting into dire "but isn't it off by 1?" territory, so
> here are some examples demonstrating how one might use this:
>
> "I want to read some .xyzw component of a vec4 stored in r7.4."
>
>     src[0] = r7.4<v;w,h>F
>     src[1] = <indirect offset>
>     src[2] = 16 bytes
>
>     Because subnr != 0, we calculate subnr * type_sz, and get:
>
>          subnr * sizeof(float) = 4 * 4 = 16 bytes
>
>     We add this to region_length to get a total size of 32 bytes.
>     DIV_ROUND_UP(32, REG_SIZE) = 32 / 32 = 1 register read.
>
>     Correct, because we're only reading the last 4 channels of r7,
>     so our region is contained within 1 register.
>
> "I want to read up to 6 UD channels starting at r7.7."
>
>     src[0] = r7.1<v;w,h>UD
>     src[1] = <indirect offset>
>     src[2] = 6 * sizeof(UD) = 24 bytes
>
>     Because subnr != 0, we calculate subnr * type_sz, and get:
>
>          subnr * sizeof(UD) = 7 * 4 = 28 bytes.
>
>     We add this to region_length to get a total of 24 + 28 = 52 bytes.
>     DIV_ROUND_UP(52, REG_SIZE) = ceil(52 / 32) = 2
>
>     Correct, because we read part of r7 and part of r8.
>
> "I want to get the ICP handles for vertex 2 out of 6 vertices."
>
>     Let's assume the first ICP handle starts at r3.
>
>     The ICP handles for vertex N take up whole register, so we would pick
>     src[0] = r(first icp handle)<8,8,1>UD
>     src[1] = <7,6,5,4,3,2,1> * 4 + 2 * REG_SIZE
>     src[2] = 6 vertices * REG_SIZE = 192 bytes
>
>     subnr == 0, so we don't add anything; region_length stays as 192.
>     DIV_ROUND_UP(192, 32) = 6.  We potentially read r3, r4, r5, r6, r7, r8,
>     which contain the ICP handles for vertex 0, 1, 2, 3, 4, 5 - 6 vertices.
>
> So I think this will work.

Yeah, this all seems totally reasonable.

Reviewed-by: Jason Ekstrand <jason.ekstrand at intel.com>

Thanks for taking the time to get this right!
--Jason

> diff --git a/src/mesa/drivers/dri/i965/brw_defines.h b/src/mesa/drivers/dri/i965/brw_defines.h
> index 6484484..0b8de63 100644
> --- a/src/mesa/drivers/dri/i965/brw_defines.h
> +++ b/src/mesa/drivers/dri/i965/brw_defines.h
> @@ -1289,6 +1289,16 @@ enum opcode {
>      * Calculate the high 32-bits of a 32x32 multiply.
>      */
>     SHADER_OPCODE_MULH,
> +
> +   /**
> +    * A MOV that uses VxH indirect addressing.
> +    *
> +    * Source 0: A register to start from (HW_REG).
> +    * Source 1: An indirect offset (in bytes, UD GRF).
> +    * Source 2: The length of the region that could be accessed (in bytes,
> +    *           UD immediate).
> +    */
> +   SHADER_OPCODE_MOV_INDIRECT,
>  };
>
>  enum brw_urb_write_flags {
> diff --git a/src/mesa/drivers/dri/i965/brw_fs.cpp b/src/mesa/drivers/dri/i965/brw_fs.cpp
> index 80b8c8e..84b5920 100644
> --- a/src/mesa/drivers/dri/i965/brw_fs.cpp
> +++ b/src/mesa/drivers/dri/i965/brw_fs.cpp
> @@ -840,6 +840,34 @@ fs_inst::regs_read(int arg) const
>     case SHADER_OPCODE_BARRIER:
>        return 1;
>
> +   case SHADER_OPCODE_MOV_INDIRECT:
> +      if (arg == 0) {
> +         assert(src[2].file == IMM);
> +         unsigned region_length = src[2].ud;
> +
> +         if (src[0].file == FIXED_GRF) {
> +            /* If the start of the region is not register aligned, then
> +             * there's some portion of the register that's technically
> +             * unread at the beginning.
> +             *
> +             * However, the register allocator works in terms of whole
> +             * registers, and does not use subnr.  It assumes that the
> +             * read starts at the beginning of the register, and extends
> +             * regs_read() whole registers beyond that.
> +             *
> +             * To compensate, we extend the region length to include this
> +             * unread portion at the beginning.
> +             */
> +            if (src[0].subnr)
> +               region_length += src[0].subnr * type_sz(src[0].type);
> +
> +            return DIV_ROUND_UP(region_length, REG_SIZE);
> +         } else {
> +            assert(!"Invalid register file");
> +         }
> +      }
> +      break;
> +
>     default:
>        if (is_tex() && arg == 0 && src[0].file == VGRF)
>           return mlen;
> diff --git a/src/mesa/drivers/dri/i965/brw_fs.h b/src/mesa/drivers/dri/i965/brw_fs.h
> index f40e58b..cbfc07f 100644
> --- a/src/mesa/drivers/dri/i965/brw_fs.h
> +++ b/src/mesa/drivers/dri/i965/brw_fs.h
> @@ -527,6 +527,11 @@ private:
>                                   struct brw_reg offset,
>                                   struct brw_reg value);
>
> +   void generate_mov_indirect(fs_inst *inst,
> +                              struct brw_reg dst,
> +                              struct brw_reg reg,
> +                              struct brw_reg indirect_byte_offset);
> +
>     bool patch_discard_jumps_to_fb_writes();
>
>     const struct brw_compiler *compiler;
> diff --git a/src/mesa/drivers/dri/i965/brw_fs_cse.cpp b/src/mesa/drivers/dri/i965/brw_fs_cse.cpp
> index 8c67caf..3c40fcd 100644
> --- a/src/mesa/drivers/dri/i965/brw_fs_cse.cpp
> +++ b/src/mesa/drivers/dri/i965/brw_fs_cse.cpp
> @@ -78,6 +78,7 @@ is_expression(const fs_visitor *v, const fs_inst *const inst)
>     case FS_OPCODE_LINTERP:
>     case SHADER_OPCODE_FIND_LIVE_CHANNEL:
>     case SHADER_OPCODE_BROADCAST:
> +   case SHADER_OPCODE_MOV_INDIRECT:
>        return true;
>     case SHADER_OPCODE_RCP:
>     case SHADER_OPCODE_RSQ:
> diff --git a/src/mesa/drivers/dri/i965/brw_fs_generator.cpp b/src/mesa/drivers/dri/i965/brw_fs_generator.cpp
> index 139cda3..e5a286a 100644
> --- a/src/mesa/drivers/dri/i965/brw_fs_generator.cpp
> +++ b/src/mesa/drivers/dri/i965/brw_fs_generator.cpp
> @@ -372,6 +372,36 @@ fs_generator::generate_fb_write(fs_inst *inst, struct brw_reg payload)
>  }
>
>  void
> +fs_generator::generate_mov_indirect(fs_inst *inst,
> +                                    struct brw_reg dst,
> +                                    struct brw_reg reg,
> +                                    struct brw_reg indirect_byte_offset)
> +{
> +   assert(indirect_byte_offset.type == BRW_REGISTER_TYPE_UD);
> +   assert(indirect_byte_offset.file == BRW_GENERAL_REGISTER_FILE);
> +
> +   unsigned imm_byte_offset = reg.nr * REG_SIZE + reg.subnr;
> +
> +   /* We use VxH indirect addressing, clobbering a0.0 through a0.7. */
> +   struct brw_reg addr = vec8(brw_address_reg(0));
> +
> +   /* The destination stride of an instruction (in bytes) must be greater
> +    * than or equal to the size of the rest of the instruction.  Since the
> +    * address register is of type UW, we can't use a D-type instruction.
> +    * In order to get around this, re re-type to UW and use a stride.
> +    */
> +   indirect_byte_offset =
> +      retype(spread(indirect_byte_offset, 2), BRW_REGISTER_TYPE_UW);
> +
> +   /* Prior to Broadwell, there are only 8 address registers. */
> +   assert(inst->exec_size == 8 || devinfo->gen >= 8);
> +
> +   brw_MOV(p, addr, indirect_byte_offset);
> +   brw_inst_set_mask_control(devinfo, brw_last_inst, BRW_MASK_DISABLE);
> +   brw_MOV(p, dst, retype(brw_VxH_indirect(0, imm_byte_offset), dst.type));
> +}
> +
> +void
>  fs_generator::generate_urb_read(fs_inst *inst,
>                                  struct brw_reg dst,
>                                  struct brw_reg header)
> @@ -2079,6 +2109,10 @@ fs_generator::generate_code(const cfg_t *cfg, int dispatch_width)
>           fill_count++;
>          break;
>
> +      case SHADER_OPCODE_MOV_INDIRECT:
> +         generate_mov_indirect(inst, dst, src[0], src[1]);
> +         break;
> +
>        case SHADER_OPCODE_URB_READ_SIMD8:
>        case SHADER_OPCODE_URB_READ_SIMD8_PER_SLOT:
>           generate_urb_read(inst, dst, src[0]);
> diff --git a/src/mesa/drivers/dri/i965/brw_shader.cpp b/src/mesa/drivers/dri/i965/brw_shader.cpp
> index a438e18..50b288b 100644
> --- a/src/mesa/drivers/dri/i965/brw_shader.cpp
> +++ b/src/mesa/drivers/dri/i965/brw_shader.cpp
> @@ -554,6 +554,8 @@ brw_instruction_name(enum opcode op)
>        return "barrier";
>     case SHADER_OPCODE_MULH:
>        return "mulh";
> +   case SHADER_OPCODE_MOV_INDIRECT:
> +      return "mov_indirect";
>     }
>
>     unreachable("not reached");
> --
> 2.6.2
>
> _______________________________________________
> mesa-dev mailing list
> mesa-dev at lists.freedesktop.org
> http://lists.freedesktop.org/mailman/listinfo/mesa-dev