[Mesa-dev] [PATCH 2/2] i965/vs: Add instruction scheduling.

Mon Apr 29 20:07:42 PDT 2013

On 04/29/2013 06:20 PM, Eric Anholt wrote:
> Kenneth Graunke <kenneth at whitecape.org> writes:
>
>> On 04/23/2013 04:56 PM, Eric Anholt wrote:
>>> While this doesn't have the detail that the FS scheduler does, and is
>>> ignorant of dependency control, it's still good for a 0.60% +/- 0.15%
>>> performance improvement on GLBenchmark 2.7 (n=45/47, outliers removed)
>>> ---
>>>    src/mesa/drivers/dri/i965/Makefile.sources         |    1 +
>>>    src/mesa/drivers/dri/i965/brw_vec4.cpp             |    9 +
>>>    src/mesa/drivers/dri/i965/brw_vec4.h               |    1 +
>>>    .../drivers/dri/i965/brw_vec4_reg_allocate.cpp     |    4 +
>>>    .../dri/i965/brw_vec4_schedule_instructions.cpp    |  484 ++++++++++++++++++++
>>>    5 files changed, 499 insertions(+)
>>>    create mode 100644 src/mesa/drivers/dri/i965/brw_vec4_schedule_instructions.cpp
>>>
>>> diff --git a/src/mesa/drivers/dri/i965/Makefile.sources b/src/mesa/drivers/dri/i965/Makefile.sources
>>> index be8d630..a5231ee 100644
>>> --- a/src/mesa/drivers/dri/i965/Makefile.sources
>>> +++ b/src/mesa/drivers/dri/i965/Makefile.sources
>>> @@ -86,6 +86,7 @@ i965_FILES = \
>>>    	brw_vec4.cpp \
>>>    	brw_vec4_copy_propagation.cpp \
>>>    	brw_vec4_emit.cpp \
>>> +	brw_vec4_schedule_instructions.cpp \
>>>    	brw_vec4_live_variables.cpp \
>>>    	brw_vec4_reg_allocate.cpp \
>>>    	brw_vec4_visitor.cpp \
>>> diff --git a/src/mesa/drivers/dri/i965/brw_vec4.cpp b/src/mesa/drivers/dri/i965/brw_vec4.cpp
>>> index 0afff6f..120a4e2 100644
>>> --- a/src/mesa/drivers/dri/i965/brw_vec4.cpp
>>> +++ b/src/mesa/drivers/dri/i965/brw_vec4.cpp
>>> @@ -285,6 +285,13 @@ vec4_visitor::implied_mrf_writes(vec4_instruction *inst)
>>>          return 3;
>>>       case SHADER_OPCODE_SHADER_TIME_ADD:
>>>          return 0;
>>> +   case SHADER_OPCODE_TEX:
>>> +   case SHADER_OPCODE_TXL:
>>> +   case SHADER_OPCODE_TXD:
>>> +   case SHADER_OPCODE_TXF:
>>> +   case SHADER_OPCODE_TXF_MS:
>>> +   case SHADER_OPCODE_TXS:
>>> +      return (inst->texture_offset || inst->header_present) ? 1 : 0;
>>
>> inst->header_present is actually true if inst->texture_offset != 0,
>> though I admit that's only obvious from the vec4_visitor side (and not
>> obvious when reading vec4_generator).
>>
>> Either way is fine.
>
> I think it'll make more sense to me if I just write inst->header_present
> then.  Sound good?

Yep, that sounds best.

>>>       default:
>>>          assert(!"not reached");
>>>          return inst->mlen;
>>> @@ -1499,6 +1506,8 @@ vec4_visitor::run()
>>>
>>>       opt_set_dependency_control();
>>>
>>> +   opt_schedule_instructions();
>>> +
>>
>> Setting the "don't check dependencies" flags and then reordering all the
>> instructions makes me nervous.  Maybe it's safe, but I'd rather not have
>> to try and justify it.  Can we switch the order of these passes?
>
> It's safe at the moment, because they're all considered to be WAW deps
> of each other.
>
> I had been thinking that we wanted the scheduler to (eventually) know
> about depctrl so that it could know that you can do the series of
> depctrled dp4s right next to each other with no depdendencies.  But in
> order to build the tracking for that, you'd be tracking the writes on a
> channel basis plus maintaining the depctrl tracking, and if you do that
> you could more easily just schedule them as if there's no dependency and
> know that the depctrl pass immediately afterwards will set the flags as
> needed.  So I think you're right about the desired ordering.

Yeah, that makes sense to me - we just want per-channel scheduling. 
opt_set_dependency_control() basically makes that scheduling work out.

>>> diff --git a/src/mesa/drivers/dri/i965/brw_vec4_reg_allocate.cpp b/src/mesa/drivers/dri/i965/brw_vec4_reg_allocate.cpp
>>> index ac3d401..7149d46 100644
>>> --- a/src/mesa/drivers/dri/i965/brw_vec4_reg_allocate.cpp
>>> +++ b/src/mesa/drivers/dri/i965/brw_vec4_reg_allocate.cpp
>>> @@ -102,6 +102,8 @@ brw_alloc_reg_set_for_classes(struct brw_context *brw,
>>>    			      int class_count,
>>>    			      int base_reg_count)
>>>    {
>>> +   struct intel_context *intel = &brw->intel;
>>> +
>>>       /* Compute the total number of registers across all classes. */
>>>       int ra_reg_count = 0;
>>>       for (int i = 0; i < class_count; i++) {
>>> @@ -112,6 +114,8 @@ brw_alloc_reg_set_for_classes(struct brw_context *brw,
>>>       brw->vs.ra_reg_to_grf = ralloc_array(brw, uint8_t, ra_reg_count);
>>>       ralloc_free(brw->vs.regs);
>>>       brw->vs.regs = ra_alloc_reg_set(brw, ra_reg_count);
>>> +   if (intel->gen >= 6)
>>> +      ra_set_allocate_round_robin(brw->vs.regs);
>>
>> Isn't this an unrelated change?  Both this and scheduling could impact
>> performance.
>
> Yeah, I should separate them out.
>
>>> diff --git a/src/mesa/drivers/dri/i965/brw_vec4_schedule_instructions.cpp b/src/mesa/drivers/dri/i965/brw_vec4_schedule_instructions.cpp
>>> new file mode 100644
>>> index 0000000..f1e997a
>>> --- /dev/null
>>> +++ b/src/mesa/drivers/dri/i965/brw_vec4_schedule_instructions.cpp
>>> @@ -0,0 +1,484 @@
>>
>> This new file has tabs.  Please replace them with spaces.
>
> Thanks.
>
>>> +    * dead code elimination anyway.
>>> +    */
>>> +   schedule_node *last = (schedule_node *)instructions.get_tail();
>>> +   add_barrier_deps(last);
>>> +
>>> +   memset(last_grf_write, 0, sizeof(last_grf_write));
>>> +   memset(last_mrf_write, 0, sizeof(last_mrf_write));
>>> +
>>> +   /* top-to-bottom dependencies: RAW and WAW. */
>>> +   foreach_list(node, &instructions) {
>>> +      schedule_node *n = (schedule_node *)node;
>>> +      vec4_instruction *inst = n->inst;
>>> +
>>> +      /* read-after-write deps. */
>>> +      for (int i = 0; i < 3; i++) {
>>> +	 if (inst->src[i].file == GRF) {
>>> +	    add_dep(last_grf_write[inst->src[i].reg], n);
>>> +	 } else if (inst->src[i].file == HW_REG &&
>>> +		    (inst->src[i].fixed_hw_reg.file ==
>>> +		     BRW_GENERAL_REGISTER_FILE)) {
>>> +	    add_dep(last_fixed_grf_write, n);
>>> +	 } else if (inst->src[i].file != BAD_FILE &&
>>> +		    inst->src[i].file != IMM &&
>>> +		    inst->src[i].file != UNIFORM) {
>>> +	    assert(inst->src[i].file != MRF);
>>
>> I had to ask "What about ATTR?" here.  It turns out that
>> setup_attributes() executed earlier and lowered them all away, but it
>> might be nice to have an explicit assertion or comment so people don't
>> have to ask.
>
> Asserted.
>
>>> +      /* write-after-write deps. */
>>> +      if (inst->dst.file == GRF) {
>>> +	 add_dep(last_grf_write[inst->dst.reg], n);
>>> +	 last_grf_write[inst->dst.reg] = n;
>>> +      } else if (inst->dst.file == MRF) {
>>> +	 add_dep(last_mrf_write[inst->dst.reg], n);
>>> +	 last_mrf_write[inst->dst.reg] = n;
>>> +     } else if (inst->dst.file == HW_REG &&
>>> +		 inst->dst.fixed_hw_reg.file == BRW_GENERAL_REGISTER_FILE) {
>>> +	 last_fixed_grf_write = n;
>>> +      } else if (inst->dst.file != BAD_FILE) {
>>> +	 add_barrier_deps(n);
>>> +      }
>>> +
>>> +      if (inst->mlen > 0) {
>>> +	 for (int i = 0; i < v->implied_mrf_writes(inst); i++) {
>>
>> Eh?  Aren't you missing a line here?  Namely:
>>
>>               add_dep(last_mrf_write[inst->base_mrf + i], n);
>>
>> The FS has it, and I doubt it's superfluous.
>
> Good catch.  Fixed.
>
>>> +      /* Now that we've scheduled a new instruction, some of its
>>> +       * children can be promoted to the list of instructions ready to
>>> +       * be scheduled.  Update the children's unblocked time for this
>>> +       * DAG edge as we do so.
>>> +       */
>>> +      for (int i = 0; i < chosen->child_count; i++) {
>>> +	 schedule_node *child = chosen->children[i];
>>> +
>>> +	 child->unblocked_time = MAX2(child->unblocked_time,
>>> +				      time + chosen->child_latency[i]);
>>> +
>>> +	 child->parent_count--;
>>> +	 if (child->parent_count == 0) {
>>> +	    instructions.push_tail(child);
>>> +	 }
>>> +      }
>>> +
>>> +      /* Shared resource: the mathbox.  There's one per EU (on later
>>> +       * generations, it's even more limited pre-gen6), so if we send
>>
>> I actually don't think this is true, at least not on IVB...there's just
>> the ALU pipe and the EM pipe, and they're both available and can process
>> things...
>>
>> But I suppose that's a separate change that we ought to do in the FS as
>> well (if it's true).
>
> Cool, that's what I feel too, once we test it.

Excellent.

>>> +      }
>>> +      sched.calculate_deps();
>>> +      sched.schedule_instructions(next_block_header);
>>> +   }
>>> +
>>> +   this->live_intervals_valid = false;
>>> +}
>>>
>>
>> A final dumb observation: a lot of the scheduler boilerplate is
>> identical between the VS and FS.  This could probably be shared by:
>>
>> 1. Making schedule_node use backend_instruction pointers.
>> 2. Creating an instruction_scheduler base class and two subclasses:
>>
>>      class instruction_scheduler {
>>         virtual calculate_deps() = 0;
>>         ...everything else...
>>      }
>>
>>      Since calculate_deps() actually looks at vec4/fs instructions, it
>> should probably be pure virtual and implemented by the subclasses.
>> Every other function is the same boilerplate and could be shared.
>
> OK, it was a bunch of work, but there turned out to be more shared code
> than I thought.  I'll send out a new series once I run some more tests.

Nice, thanks for doing that.  Sorry for the extra work.