[Mesa-dev] [PATCH 2/2] i965/vs: Add instruction scheduling.
Kenneth Graunke
kenneth at whitecape.org
Mon Apr 29 20:07:42 PDT 2013
On 04/29/2013 06:20 PM, Eric Anholt wrote:
> Kenneth Graunke <kenneth at whitecape.org> writes:
>
>> On 04/23/2013 04:56 PM, Eric Anholt wrote:
>>> While this doesn't have the detail that the FS scheduler does, and is
>>> ignorant of dependency control, it's still good for a 0.60% +/- 0.15%
>>> performance improvement on GLBenchmark 2.7 (n=45/47, outliers removed)
>>> ---
>>> src/mesa/drivers/dri/i965/Makefile.sources | 1 +
>>> src/mesa/drivers/dri/i965/brw_vec4.cpp | 9 +
>>> src/mesa/drivers/dri/i965/brw_vec4.h | 1 +
>>> .../drivers/dri/i965/brw_vec4_reg_allocate.cpp | 4 +
>>> .../dri/i965/brw_vec4_schedule_instructions.cpp | 484 ++++++++++++++++++++
>>> 5 files changed, 499 insertions(+)
>>> create mode 100644 src/mesa/drivers/dri/i965/brw_vec4_schedule_instructions.cpp
>>>
>>> diff --git a/src/mesa/drivers/dri/i965/Makefile.sources b/src/mesa/drivers/dri/i965/Makefile.sources
>>> index be8d630..a5231ee 100644
>>> --- a/src/mesa/drivers/dri/i965/Makefile.sources
>>> +++ b/src/mesa/drivers/dri/i965/Makefile.sources
>>> @@ -86,6 +86,7 @@ i965_FILES = \
>>> brw_vec4.cpp \
>>> brw_vec4_copy_propagation.cpp \
>>> brw_vec4_emit.cpp \
>>> + brw_vec4_schedule_instructions.cpp \
>>> brw_vec4_live_variables.cpp \
>>> brw_vec4_reg_allocate.cpp \
>>> brw_vec4_visitor.cpp \
>>> diff --git a/src/mesa/drivers/dri/i965/brw_vec4.cpp b/src/mesa/drivers/dri/i965/brw_vec4.cpp
>>> index 0afff6f..120a4e2 100644
>>> --- a/src/mesa/drivers/dri/i965/brw_vec4.cpp
>>> +++ b/src/mesa/drivers/dri/i965/brw_vec4.cpp
>>> @@ -285,6 +285,13 @@ vec4_visitor::implied_mrf_writes(vec4_instruction *inst)
>>> return 3;
>>> case SHADER_OPCODE_SHADER_TIME_ADD:
>>> return 0;
>>> + case SHADER_OPCODE_TEX:
>>> + case SHADER_OPCODE_TXL:
>>> + case SHADER_OPCODE_TXD:
>>> + case SHADER_OPCODE_TXF:
>>> + case SHADER_OPCODE_TXF_MS:
>>> + case SHADER_OPCODE_TXS:
>>> + return (inst->texture_offset || inst->header_present) ? 1 : 0;
>>
>> inst->header_present is actually true if inst->texture_offset != 0,
>> though I admit that's only obvious from the vec4_visitor side (and not
>> obvious when reading vec4_generator).
>>
>> Either way is fine.
>
> I think it'll make more sense to me if I just write inst->header_present
> then. Sound good?
Yep, that sounds best.
>>> default:
>>> assert(!"not reached");
>>> return inst->mlen;
>>> @@ -1499,6 +1506,8 @@ vec4_visitor::run()
>>>
>>> opt_set_dependency_control();
>>>
>>> + opt_schedule_instructions();
>>> +
>>
>> Setting the "don't check dependencies" flags and then reordering all the
>> instructions makes me nervous. Maybe it's safe, but I'd rather not have
>> to try and justify it. Can we switch the order of these passes?
>
> It's safe at the moment, because they're all considered to be WAW deps
> of each other.
>
> I had been thinking that we wanted the scheduler to (eventually) know
> about depctrl so that it could know that you can do the series of
> depctrled dp4s right next to each other with no depdendencies. But in
> order to build the tracking for that, you'd be tracking the writes on a
> channel basis plus maintaining the depctrl tracking, and if you do that
> you could more easily just schedule them as if there's no dependency and
> know that the depctrl pass immediately afterwards will set the flags as
> needed. So I think you're right about the desired ordering.
Yeah, that makes sense to me - we just want per-channel scheduling.
opt_set_dependency_control() basically makes that scheduling work out.
>>> diff --git a/src/mesa/drivers/dri/i965/brw_vec4_reg_allocate.cpp b/src/mesa/drivers/dri/i965/brw_vec4_reg_allocate.cpp
>>> index ac3d401..7149d46 100644
>>> --- a/src/mesa/drivers/dri/i965/brw_vec4_reg_allocate.cpp
>>> +++ b/src/mesa/drivers/dri/i965/brw_vec4_reg_allocate.cpp
>>> @@ -102,6 +102,8 @@ brw_alloc_reg_set_for_classes(struct brw_context *brw,
>>> int class_count,
>>> int base_reg_count)
>>> {
>>> + struct intel_context *intel = &brw->intel;
>>> +
>>> /* Compute the total number of registers across all classes. */
>>> int ra_reg_count = 0;
>>> for (int i = 0; i < class_count; i++) {
>>> @@ -112,6 +114,8 @@ brw_alloc_reg_set_for_classes(struct brw_context *brw,
>>> brw->vs.ra_reg_to_grf = ralloc_array(brw, uint8_t, ra_reg_count);
>>> ralloc_free(brw->vs.regs);
>>> brw->vs.regs = ra_alloc_reg_set(brw, ra_reg_count);
>>> + if (intel->gen >= 6)
>>> + ra_set_allocate_round_robin(brw->vs.regs);
>>
>> Isn't this an unrelated change? Both this and scheduling could impact
>> performance.
>
> Yeah, I should separate them out.
>
>>> diff --git a/src/mesa/drivers/dri/i965/brw_vec4_schedule_instructions.cpp b/src/mesa/drivers/dri/i965/brw_vec4_schedule_instructions.cpp
>>> new file mode 100644
>>> index 0000000..f1e997a
>>> --- /dev/null
>>> +++ b/src/mesa/drivers/dri/i965/brw_vec4_schedule_instructions.cpp
>>> @@ -0,0 +1,484 @@
>>
>> This new file has tabs. Please replace them with spaces.
>
> Thanks.
>
>>> + * dead code elimination anyway.
>>> + */
>>> + schedule_node *last = (schedule_node *)instructions.get_tail();
>>> + add_barrier_deps(last);
>>> +
>>> + memset(last_grf_write, 0, sizeof(last_grf_write));
>>> + memset(last_mrf_write, 0, sizeof(last_mrf_write));
>>> +
>>> + /* top-to-bottom dependencies: RAW and WAW. */
>>> + foreach_list(node, &instructions) {
>>> + schedule_node *n = (schedule_node *)node;
>>> + vec4_instruction *inst = n->inst;
>>> +
>>> + /* read-after-write deps. */
>>> + for (int i = 0; i < 3; i++) {
>>> + if (inst->src[i].file == GRF) {
>>> + add_dep(last_grf_write[inst->src[i].reg], n);
>>> + } else if (inst->src[i].file == HW_REG &&
>>> + (inst->src[i].fixed_hw_reg.file ==
>>> + BRW_GENERAL_REGISTER_FILE)) {
>>> + add_dep(last_fixed_grf_write, n);
>>> + } else if (inst->src[i].file != BAD_FILE &&
>>> + inst->src[i].file != IMM &&
>>> + inst->src[i].file != UNIFORM) {
>>> + assert(inst->src[i].file != MRF);
>>
>> I had to ask "What about ATTR?" here. It turns out that
>> setup_attributes() executed earlier and lowered them all away, but it
>> might be nice to have an explicit assertion or comment so people don't
>> have to ask.
>
> Asserted.
>
>>> + /* write-after-write deps. */
>>> + if (inst->dst.file == GRF) {
>>> + add_dep(last_grf_write[inst->dst.reg], n);
>>> + last_grf_write[inst->dst.reg] = n;
>>> + } else if (inst->dst.file == MRF) {
>>> + add_dep(last_mrf_write[inst->dst.reg], n);
>>> + last_mrf_write[inst->dst.reg] = n;
>>> + } else if (inst->dst.file == HW_REG &&
>>> + inst->dst.fixed_hw_reg.file == BRW_GENERAL_REGISTER_FILE) {
>>> + last_fixed_grf_write = n;
>>> + } else if (inst->dst.file != BAD_FILE) {
>>> + add_barrier_deps(n);
>>> + }
>>> +
>>> + if (inst->mlen > 0) {
>>> + for (int i = 0; i < v->implied_mrf_writes(inst); i++) {
>>
>> Eh? Aren't you missing a line here? Namely:
>>
>> add_dep(last_mrf_write[inst->base_mrf + i], n);
>>
>> The FS has it, and I doubt it's superfluous.
>
> Good catch. Fixed.
>
>>> + /* Now that we've scheduled a new instruction, some of its
>>> + * children can be promoted to the list of instructions ready to
>>> + * be scheduled. Update the children's unblocked time for this
>>> + * DAG edge as we do so.
>>> + */
>>> + for (int i = 0; i < chosen->child_count; i++) {
>>> + schedule_node *child = chosen->children[i];
>>> +
>>> + child->unblocked_time = MAX2(child->unblocked_time,
>>> + time + chosen->child_latency[i]);
>>> +
>>> + child->parent_count--;
>>> + if (child->parent_count == 0) {
>>> + instructions.push_tail(child);
>>> + }
>>> + }
>>> +
>>> + /* Shared resource: the mathbox. There's one per EU (on later
>>> + * generations, it's even more limited pre-gen6), so if we send
>>
>> I actually don't think this is true, at least not on IVB...there's just
>> the ALU pipe and the EM pipe, and they're both available and can process
>> things...
>>
>> But I suppose that's a separate change that we ought to do in the FS as
>> well (if it's true).
>
> Cool, that's what I feel too, once we test it.
Excellent.
>>> + }
>>> + sched.calculate_deps();
>>> + sched.schedule_instructions(next_block_header);
>>> + }
>>> +
>>> + this->live_intervals_valid = false;
>>> +}
>>>
>>
>> A final dumb observation: a lot of the scheduler boilerplate is
>> identical between the VS and FS. This could probably be shared by:
>>
>> 1. Making schedule_node use backend_instruction pointers.
>> 2. Creating an instruction_scheduler base class and two subclasses:
>>
>> class instruction_scheduler {
>> virtual calculate_deps() = 0;
>> ...everything else...
>> }
>>
>> Since calculate_deps() actually looks at vec4/fs instructions, it
>> should probably be pure virtual and implemented by the subclasses.
>> Every other function is the same boilerplate and could be shared.
>
> OK, it was a bunch of work, but there turned out to be more shared code
> than I thought. I'll send out a new series once I run some more tests.
Nice, thanks for doing that. Sorry for the extra work.
More information about the mesa-dev
mailing list