[Mesa-dev] [PATCH 2/2] i965/vs: Add instruction scheduling.

Mon Apr 29 18:20:20 PDT 2013

Kenneth Graunke <kenneth at whitecape.org> writes:

> On 04/23/2013 04:56 PM, Eric Anholt wrote:
>> While this doesn't have the detail that the FS scheduler does, and is
>> ignorant of dependency control, it's still good for a 0.60% +/- 0.15%
>> performance improvement on GLBenchmark 2.7 (n=45/47, outliers removed)
>> ---
>>   src/mesa/drivers/dri/i965/Makefile.sources         |    1 +
>>   src/mesa/drivers/dri/i965/brw_vec4.cpp             |    9 +
>>   src/mesa/drivers/dri/i965/brw_vec4.h               |    1 +
>>   .../drivers/dri/i965/brw_vec4_reg_allocate.cpp     |    4 +
>>   .../dri/i965/brw_vec4_schedule_instructions.cpp    |  484 ++++++++++++++++++++
>>   5 files changed, 499 insertions(+)
>>   create mode 100644 src/mesa/drivers/dri/i965/brw_vec4_schedule_instructions.cpp
>>
>> diff --git a/src/mesa/drivers/dri/i965/Makefile.sources b/src/mesa/drivers/dri/i965/Makefile.sources
>> index be8d630..a5231ee 100644
>> --- a/src/mesa/drivers/dri/i965/Makefile.sources
>> +++ b/src/mesa/drivers/dri/i965/Makefile.sources
>> @@ -86,6 +86,7 @@ i965_FILES = \
>>   	brw_vec4.cpp \
>>   	brw_vec4_copy_propagation.cpp \
>>   	brw_vec4_emit.cpp \
>> +	brw_vec4_schedule_instructions.cpp \
>>   	brw_vec4_live_variables.cpp \
>>   	brw_vec4_reg_allocate.cpp \
>>   	brw_vec4_visitor.cpp \
>> diff --git a/src/mesa/drivers/dri/i965/brw_vec4.cpp b/src/mesa/drivers/dri/i965/brw_vec4.cpp
>> index 0afff6f..120a4e2 100644
>> --- a/src/mesa/drivers/dri/i965/brw_vec4.cpp
>> +++ b/src/mesa/drivers/dri/i965/brw_vec4.cpp
>> @@ -285,6 +285,13 @@ vec4_visitor::implied_mrf_writes(vec4_instruction *inst)
>>         return 3;
>>      case SHADER_OPCODE_SHADER_TIME_ADD:
>>         return 0;
>> +   case SHADER_OPCODE_TEX:
>> +   case SHADER_OPCODE_TXL:
>> +   case SHADER_OPCODE_TXD:
>> +   case SHADER_OPCODE_TXF:
>> +   case SHADER_OPCODE_TXF_MS:
>> +   case SHADER_OPCODE_TXS:
>> +      return (inst->texture_offset || inst->header_present) ? 1 : 0;
>
> inst->header_present is actually true if inst->texture_offset != 0, 
> though I admit that's only obvious from the vec4_visitor side (and not 
> obvious when reading vec4_generator).
>
> Either way is fine.

I think it'll make more sense to me if I just write inst->header_present
then.  Sound good?

>>      default:
>>         assert(!"not reached");
>>         return inst->mlen;
>> @@ -1499,6 +1506,8 @@ vec4_visitor::run()
>>
>>      opt_set_dependency_control();
>>
>> +   opt_schedule_instructions();
>> +
>
> Setting the "don't check dependencies" flags and then reordering all the 
> instructions makes me nervous.  Maybe it's safe, but I'd rather not have 
> to try and justify it.  Can we switch the order of these passes?

It's safe at the moment, because they're all considered to be WAW deps
of each other.

I had been thinking that we wanted the scheduler to (eventually) know
about depctrl so that it could know that you can do the series of
depctrled dp4s right next to each other with no depdendencies.  But in
order to build the tracking for that, you'd be tracking the writes on a
channel basis plus maintaining the depctrl tracking, and if you do that
you could more easily just schedule them as if there's no dependency and
know that the depctrl pass immediately afterwards will set the flags as
needed.  So I think you're right about the desired ordering.

>> diff --git a/src/mesa/drivers/dri/i965/brw_vec4_reg_allocate.cpp b/src/mesa/drivers/dri/i965/brw_vec4_reg_allocate.cpp
>> index ac3d401..7149d46 100644
>> --- a/src/mesa/drivers/dri/i965/brw_vec4_reg_allocate.cpp
>> +++ b/src/mesa/drivers/dri/i965/brw_vec4_reg_allocate.cpp
>> @@ -102,6 +102,8 @@ brw_alloc_reg_set_for_classes(struct brw_context *brw,
>>   			      int class_count,
>>   			      int base_reg_count)
>>   {
>> +   struct intel_context *intel = &brw->intel;
>> +
>>      /* Compute the total number of registers across all classes. */
>>      int ra_reg_count = 0;
>>      for (int i = 0; i < class_count; i++) {
>> @@ -112,6 +114,8 @@ brw_alloc_reg_set_for_classes(struct brw_context *brw,
>>      brw->vs.ra_reg_to_grf = ralloc_array(brw, uint8_t, ra_reg_count);
>>      ralloc_free(brw->vs.regs);
>>      brw->vs.regs = ra_alloc_reg_set(brw, ra_reg_count);
>> +   if (intel->gen >= 6)
>> +      ra_set_allocate_round_robin(brw->vs.regs);
>
> Isn't this an unrelated change?  Both this and scheduling could impact 
> performance.

Yeah, I should separate them out.

>> diff --git a/src/mesa/drivers/dri/i965/brw_vec4_schedule_instructions.cpp b/src/mesa/drivers/dri/i965/brw_vec4_schedule_instructions.cpp
>> new file mode 100644
>> index 0000000..f1e997a
>> --- /dev/null
>> +++ b/src/mesa/drivers/dri/i965/brw_vec4_schedule_instructions.cpp
>> @@ -0,0 +1,484 @@
>
> This new file has tabs.  Please replace them with spaces.

Thanks.

>> +    * dead code elimination anyway.
>> +    */
>> +   schedule_node *last = (schedule_node *)instructions.get_tail();
>> +   add_barrier_deps(last);
>> +
>> +   memset(last_grf_write, 0, sizeof(last_grf_write));
>> +   memset(last_mrf_write, 0, sizeof(last_mrf_write));
>> +
>> +   /* top-to-bottom dependencies: RAW and WAW. */
>> +   foreach_list(node, &instructions) {
>> +      schedule_node *n = (schedule_node *)node;
>> +      vec4_instruction *inst = n->inst;
>> +
>> +      /* read-after-write deps. */
>> +      for (int i = 0; i < 3; i++) {
>> +	 if (inst->src[i].file == GRF) {
>> +	    add_dep(last_grf_write[inst->src[i].reg], n);
>> +	 } else if (inst->src[i].file == HW_REG &&
>> +		    (inst->src[i].fixed_hw_reg.file ==
>> +		     BRW_GENERAL_REGISTER_FILE)) {
>> +	    add_dep(last_fixed_grf_write, n);
>> +	 } else if (inst->src[i].file != BAD_FILE &&
>> +		    inst->src[i].file != IMM &&
>> +		    inst->src[i].file != UNIFORM) {
>> +	    assert(inst->src[i].file != MRF);
>
> I had to ask "What about ATTR?" here.  It turns out that 
> setup_attributes() executed earlier and lowered them all away, but it 
> might be nice to have an explicit assertion or comment so people don't 
> have to ask.

Asserted.

>> +      /* write-after-write deps. */
>> +      if (inst->dst.file == GRF) {
>> +	 add_dep(last_grf_write[inst->dst.reg], n);
>> +	 last_grf_write[inst->dst.reg] = n;
>> +      } else if (inst->dst.file == MRF) {
>> +	 add_dep(last_mrf_write[inst->dst.reg], n);
>> +	 last_mrf_write[inst->dst.reg] = n;
>> +     } else if (inst->dst.file == HW_REG &&
>> +		 inst->dst.fixed_hw_reg.file == BRW_GENERAL_REGISTER_FILE) {
>> +	 last_fixed_grf_write = n;
>> +      } else if (inst->dst.file != BAD_FILE) {
>> +	 add_barrier_deps(n);
>> +      }
>> +
>> +      if (inst->mlen > 0) {
>> +	 for (int i = 0; i < v->implied_mrf_writes(inst); i++) {
>
> Eh?  Aren't you missing a line here?  Namely:
>
>              add_dep(last_mrf_write[inst->base_mrf + i], n);
>
> The FS has it, and I doubt it's superfluous.

Good catch.  Fixed.

>> +      /* Now that we've scheduled a new instruction, some of its
>> +       * children can be promoted to the list of instructions ready to
>> +       * be scheduled.  Update the children's unblocked time for this
>> +       * DAG edge as we do so.
>> +       */
>> +      for (int i = 0; i < chosen->child_count; i++) {
>> +	 schedule_node *child = chosen->children[i];
>> +
>> +	 child->unblocked_time = MAX2(child->unblocked_time,
>> +				      time + chosen->child_latency[i]);
>> +
>> +	 child->parent_count--;
>> +	 if (child->parent_count == 0) {
>> +	    instructions.push_tail(child);
>> +	 }
>> +      }
>> +
>> +      /* Shared resource: the mathbox.  There's one per EU (on later
>> +       * generations, it's even more limited pre-gen6), so if we send
>
> I actually don't think this is true, at least not on IVB...there's just 
> the ALU pipe and the EM pipe, and they're both available and can process 
> things...
>
> But I suppose that's a separate change that we ought to do in the FS as 
> well (if it's true).

Cool, that's what I feel too, once we test it.

>> +      }
>> +      sched.calculate_deps();
>> +      sched.schedule_instructions(next_block_header);
>> +   }
>> +
>> +   this->live_intervals_valid = false;
>> +}
>>
>
> A final dumb observation: a lot of the scheduler boilerplate is 
> identical between the VS and FS.  This could probably be shared by:
>
> 1. Making schedule_node use backend_instruction pointers.
> 2. Creating an instruction_scheduler base class and two subclasses:
>
>     class instruction_scheduler {
>        virtual calculate_deps() = 0;
>        ...everything else...
>     }
>
>     Since calculate_deps() actually looks at vec4/fs instructions, it 
> should probably be pure virtual and implemented by the subclasses. 
> Every other function is the same boilerplate and could be shared.

OK, it was a bunch of work, but there turned out to be more shared code
than I thought.  I'll send out a new series once I run some more tests.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 197 bytes
Desc: not available
URL: <http://lists.freedesktop.org/archives/mesa-dev/attachments/20130429/862db9ab/attachment.pgp>