[Mesa-dev] [PATCH 2/2] i965/vs: Add instruction scheduling.
Kenneth Graunke
kenneth at whitecape.org
Sun Apr 28 01:54:22 PDT 2013
On 04/23/2013 04:56 PM, Eric Anholt wrote:
> While this doesn't have the detail that the FS scheduler does, and is
> ignorant of dependency control, it's still good for a 0.60% +/- 0.15%
> performance improvement on GLBenchmark 2.7 (n=45/47, outliers removed)
> ---
> src/mesa/drivers/dri/i965/Makefile.sources | 1 +
> src/mesa/drivers/dri/i965/brw_vec4.cpp | 9 +
> src/mesa/drivers/dri/i965/brw_vec4.h | 1 +
> .../drivers/dri/i965/brw_vec4_reg_allocate.cpp | 4 +
> .../dri/i965/brw_vec4_schedule_instructions.cpp | 484 ++++++++++++++++++++
> 5 files changed, 499 insertions(+)
> create mode 100644 src/mesa/drivers/dri/i965/brw_vec4_schedule_instructions.cpp
>
> diff --git a/src/mesa/drivers/dri/i965/Makefile.sources b/src/mesa/drivers/dri/i965/Makefile.sources
> index be8d630..a5231ee 100644
> --- a/src/mesa/drivers/dri/i965/Makefile.sources
> +++ b/src/mesa/drivers/dri/i965/Makefile.sources
> @@ -86,6 +86,7 @@ i965_FILES = \
> brw_vec4.cpp \
> brw_vec4_copy_propagation.cpp \
> brw_vec4_emit.cpp \
> + brw_vec4_schedule_instructions.cpp \
> brw_vec4_live_variables.cpp \
> brw_vec4_reg_allocate.cpp \
> brw_vec4_visitor.cpp \
> diff --git a/src/mesa/drivers/dri/i965/brw_vec4.cpp b/src/mesa/drivers/dri/i965/brw_vec4.cpp
> index 0afff6f..120a4e2 100644
> --- a/src/mesa/drivers/dri/i965/brw_vec4.cpp
> +++ b/src/mesa/drivers/dri/i965/brw_vec4.cpp
> @@ -285,6 +285,13 @@ vec4_visitor::implied_mrf_writes(vec4_instruction *inst)
> return 3;
> case SHADER_OPCODE_SHADER_TIME_ADD:
> return 0;
> + case SHADER_OPCODE_TEX:
> + case SHADER_OPCODE_TXL:
> + case SHADER_OPCODE_TXD:
> + case SHADER_OPCODE_TXF:
> + case SHADER_OPCODE_TXF_MS:
> + case SHADER_OPCODE_TXS:
> + return (inst->texture_offset || inst->header_present) ? 1 : 0;
inst->header_present is actually true if inst->texture_offset != 0,
though I admit that's only obvious from the vec4_visitor side (and not
obvious when reading vec4_generator).
Either way is fine.
> default:
> assert(!"not reached");
> return inst->mlen;
> @@ -1499,6 +1506,8 @@ vec4_visitor::run()
>
> opt_set_dependency_control();
>
> + opt_schedule_instructions();
> +
Setting the "don't check dependencies" flags and then reordering all the
instructions makes me nervous. Maybe it's safe, but I'd rather not have
to try and justify it. Can we switch the order of these passes?
> /* If any state parameters were appended, then ParameterValues could have
> * been realloced, in which case the driver uniform storage set up by
> * _mesa_associate_uniform_storage() would point to freed memory. Make
> diff --git a/src/mesa/drivers/dri/i965/brw_vec4.h b/src/mesa/drivers/dri/i965/brw_vec4.h
> index 697ab86..c9f1a4e 100644
> --- a/src/mesa/drivers/dri/i965/brw_vec4.h
> +++ b/src/mesa/drivers/dri/i965/brw_vec4.h
> @@ -340,6 +340,7 @@ public:
> bool opt_algebraic();
> bool opt_register_coalesce();
> void opt_set_dependency_control();
> + void opt_schedule_instructions();
>
> bool can_do_source_mods(vec4_instruction *inst);
>
> diff --git a/src/mesa/drivers/dri/i965/brw_vec4_reg_allocate.cpp b/src/mesa/drivers/dri/i965/brw_vec4_reg_allocate.cpp
> index ac3d401..7149d46 100644
> --- a/src/mesa/drivers/dri/i965/brw_vec4_reg_allocate.cpp
> +++ b/src/mesa/drivers/dri/i965/brw_vec4_reg_allocate.cpp
> @@ -102,6 +102,8 @@ brw_alloc_reg_set_for_classes(struct brw_context *brw,
> int class_count,
> int base_reg_count)
> {
> + struct intel_context *intel = &brw->intel;
> +
> /* Compute the total number of registers across all classes. */
> int ra_reg_count = 0;
> for (int i = 0; i < class_count; i++) {
> @@ -112,6 +114,8 @@ brw_alloc_reg_set_for_classes(struct brw_context *brw,
> brw->vs.ra_reg_to_grf = ralloc_array(brw, uint8_t, ra_reg_count);
> ralloc_free(brw->vs.regs);
> brw->vs.regs = ra_alloc_reg_set(brw, ra_reg_count);
> + if (intel->gen >= 6)
> + ra_set_allocate_round_robin(brw->vs.regs);
Isn't this an unrelated change? Both this and scheduling could impact
performance.
> ralloc_free(brw->vs.classes);
> brw->vs.classes = ralloc_array(brw, int, class_count + 1);
>
> diff --git a/src/mesa/drivers/dri/i965/brw_vec4_schedule_instructions.cpp b/src/mesa/drivers/dri/i965/brw_vec4_schedule_instructions.cpp
> new file mode 100644
> index 0000000..f1e997a
> --- /dev/null
> +++ b/src/mesa/drivers/dri/i965/brw_vec4_schedule_instructions.cpp
> @@ -0,0 +1,484 @@
This new file has tabs. Please replace them with spaces.
> +/*
> + * Copyright © 2010 Intel Corporation
Guessing you want 2013 here.
> + *
> + * Permission is hereby granted, free of charge, to any person obtaining a
> + * copy of this software and associated documentation files (the "Software"),
> + * to deal in the Software without restriction, including without limitation
> + * the rights to use, copy, modify, merge, publish, distribute, sublicense,
> + * and/or sell copies of the Software, and to permit persons to whom the
> + * Software is furnished to do so, subject to the following conditions:
> + *
> + * The above copyright notice and this permission notice (including the next
> + * paragraph) shall be included in all copies or substantial portions of the
> + * Software.
> + *
> + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
> + * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
> + * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL
> + * THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
> + * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
> + * FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS
> + * IN THE SOFTWARE.
> + *
> + * Authors:
> + * Eric Anholt <eric at anholt.net>
> + *
> + */
> +
> +extern "C" {
> +#include "main/macros.h"
> +}
> +#include "brw_vec4.h"
> +#include "glsl/glsl_types.h"
> +#include "glsl/ir_optimization.h"
> +#include "glsl/ir_print_visitor.h"
> +
> +/** @file brw_vec4_schedule_instructions.cpp
> + *
> + * List scheduling of vec4 instructions.
> + *
> + * The basic model of the list scheduler is to take a basic block,
> + * compute a DAG of the dependencies (RAW ordering with latency, WAW
> + * ordering, WAR ordering), and make a list of the DAG heads.
> + * Heuristically pick a DAG head, then put all the children that are
> + * now DAG heads into the list of things to schedule.
> + *
> + * The heuristic is the important part. We're trying to be cheap,
> + * since actually computing the optimal scheduling is NP complete.
> + * What we do is track a "current clock". When we schedule a node, we
> + * update the earliest-unblocked clock time of its children, and
> + * increment the clock. Then, when trying to schedule, we just pick
> + * the earliest-unblocked instruction to schedule.
> + *
> + * Note that often there will be many things which could execute
> + * immediately, and there are a range of heuristic options to choose
> + * from in picking among those.
> + */
> +using namespace brw;
> +
> +namespace {
> +
> +class schedule_node : public exec_node
> +{
> +public:
> + schedule_node(vec4_instruction *inst)
> + {
> + this->inst = inst;
> + this->child_array_size = 0;
> + this->children = NULL;
> + this->child_latency = NULL;
> + this->child_count = 0;
> + this->parent_count = 0;
> + this->unblocked_time = 0;
> +
> + if (inst->is_tex() ||
> + inst->opcode == VS_OPCODE_PULL_CONSTANT_LOAD) {
> + /* Testing of the latency of texturing instructions in the FS on IVB
> + * showed a minimum of 130 cycles in the cache-totally-host case, and
s/host/hot
> + * a minimum of ~630 cycles in the L1-L3 cache cold case. Pick some
> + * number in between.
> + *
> + * Since we use sampler LD for our pull constants, they get the same
> + * treatment.
> + */
> + this->latency = 300;
> + } else {
> + /* On gen7+, latency of floating point math and alu instructions is
> + * 14 extra cycles.
> + */
> + this->latency = 14;
> + }
> + }
> +
> + vec4_instruction *inst;
> + schedule_node **children;
> + int *child_latency;
> + int child_count;
> + int parent_count;
> + int child_array_size;
> + int unblocked_time;
> + int latency;
> +};
> +
> +class instruction_scheduler {
> +public:
> + instruction_scheduler(vec4_visitor *v, void *mem_ctx, int grf_count)
> + {
> + this->v = v;
> + this->mem_ctx = ralloc_context(mem_ctx);
> + this->grf_count = grf_count;
> + this->instructions.make_empty();
Calling make_empty() is unnecessary. The default exec_list constructor
already does that for you.
> + this->instructions_to_schedule = 0;
> + }
> +
> + ~instruction_scheduler()
> + {
> + ralloc_free(this->mem_ctx);
> + }
> + void add_barrier_deps(schedule_node *n);
> + void add_dep(schedule_node *before, schedule_node *after, int latency);
> + void add_dep(schedule_node *before, schedule_node *after);
> +
> + void add_inst(vec4_instruction *inst);
> + void calculate_deps();
> + void schedule_instructions(vec4_instruction *next_block_header);
> +
> + void *mem_ctx;
> +
> + int instructions_to_schedule;
> + int grf_count;
> + exec_list instructions;
> + vec4_visitor *v;
> +};
> +
> +void
> +instruction_scheduler::add_inst(vec4_instruction *inst)
> +{
> + schedule_node *n = new(mem_ctx) schedule_node(inst);
> +
> + assert(!inst->is_head_sentinel());
> + assert(!inst->is_tail_sentinel());
> +
> + this->instructions_to_schedule++;
> +
> + inst->remove();
> + instructions.push_tail(n);
> +}
> +
> +/**
> + * Add a dependency between two instruction nodes.
> + *
> + * The @after node will be scheduled after @before. We will try to
> + * schedule it @latency cycles after @before, but no guarantees there.
> + */
> +void
> +instruction_scheduler::add_dep(schedule_node *before, schedule_node *after,
> + int latency)
> +{
> + if (!before || !after)
> + return;
> +
> + assert(before != after);
> +
> + for (int i = 0; i < before->child_count; i++) {
> + if (before->children[i] == after) {
> + before->child_latency[i] = MAX2(before->child_latency[i], latency);
> + return;
> + }
> + }
> +
> + if (before->child_array_size <= before->child_count) {
> + if (before->child_array_size < 16)
> + before->child_array_size = 16;
> + else
> + before->child_array_size *= 2;
> +
> + before->children = reralloc(mem_ctx, before->children,
> + schedule_node *,
> + before->child_array_size);
> + before->child_latency = reralloc(mem_ctx, before->child_latency,
> + int, before->child_array_size);
> + }
> +
> + before->children[before->child_count] = after;
> + before->child_latency[before->child_count] = latency;
> + before->child_count++;
> + after->parent_count++;
> +}
> +
> +void
> +instruction_scheduler::add_dep(schedule_node *before, schedule_node *after)
> +{
> + if (!before)
> + return;
> +
> + add_dep(before, after, before->latency);
> +}
> +
> +/**
> + * Sometimes we really want this node to execute after everything that
> + * was before it and before everything that followed it. This adds
> + * the deps to do so.
> + */
> +void
> +instruction_scheduler::add_barrier_deps(schedule_node *n)
> +{
> + schedule_node *prev = (schedule_node *)n->prev;
> + schedule_node *next = (schedule_node *)n->next;
> +
> + if (prev) {
> + while (!prev->is_head_sentinel()) {
> + add_dep(prev, n, 0);
> + prev = (schedule_node *)prev->prev;
> + }
> + }
> +
> + if (next) {
> + while (!next->is_tail_sentinel()) {
> + add_dep(n, next, 0);
> + next = (schedule_node *)next->next;
> + }
> + }
> +}
> +
> +void
> +instruction_scheduler::calculate_deps()
> +{
> + schedule_node *last_grf_write[grf_count];
> + schedule_node *last_mrf_write[BRW_MAX_MRF];
> + schedule_node *last_conditional_mod = NULL;
> + /* Fixed HW registers are assumed to be separate from the virtual
> + * GRFs, so they can be tracked separately. We don't really write
> + * to fixed GRFs much, so don't bother tracking them on a more
> + * granular level.
> + */
> + schedule_node *last_fixed_grf_write = NULL;
> +
> + /* The last instruction always needs to still be the last
> + * instruction. Either it's flow control (IF, ELSE, ENDIF, DO,
> + * WHILE) and scheduling other things after it would disturb the
> + * basic block, or it's FB_WRITE and we should do a better job at
It better not be a FB_WRITE :) "URB_WRITE" would be fine.
> + * dead code elimination anyway.
> + */
> + schedule_node *last = (schedule_node *)instructions.get_tail();
> + add_barrier_deps(last);
> +
> + memset(last_grf_write, 0, sizeof(last_grf_write));
> + memset(last_mrf_write, 0, sizeof(last_mrf_write));
> +
> + /* top-to-bottom dependencies: RAW and WAW. */
> + foreach_list(node, &instructions) {
> + schedule_node *n = (schedule_node *)node;
> + vec4_instruction *inst = n->inst;
> +
> + /* read-after-write deps. */
> + for (int i = 0; i < 3; i++) {
> + if (inst->src[i].file == GRF) {
> + add_dep(last_grf_write[inst->src[i].reg], n);
> + } else if (inst->src[i].file == HW_REG &&
> + (inst->src[i].fixed_hw_reg.file ==
> + BRW_GENERAL_REGISTER_FILE)) {
> + add_dep(last_fixed_grf_write, n);
> + } else if (inst->src[i].file != BAD_FILE &&
> + inst->src[i].file != IMM &&
> + inst->src[i].file != UNIFORM) {
> + assert(inst->src[i].file != MRF);
I had to ask "What about ATTR?" here. It turns out that
setup_attributes() executed earlier and lowered them all away, but it
might be nice to have an explicit assertion or comment so people don't
have to ask.
> + add_barrier_deps(n);
> + }
> + }
> +
> + for (int i = 0; i < inst->mlen; i++) {
> + /* It looks like the MRF regs are released in the send
> + * instruction once it's sent, not when the result comes
> + * back.
> + */
> + add_dep(last_mrf_write[inst->base_mrf + i], n);
> + }
> +
> + if (inst->predicate) {
> + assert(last_conditional_mod);
> + add_dep(last_conditional_mod, n);
> + }
> +
> + /* write-after-write deps. */
> + if (inst->dst.file == GRF) {
> + add_dep(last_grf_write[inst->dst.reg], n);
> + last_grf_write[inst->dst.reg] = n;
> + } else if (inst->dst.file == MRF) {
> + add_dep(last_mrf_write[inst->dst.reg], n);
> + last_mrf_write[inst->dst.reg] = n;
> + } else if (inst->dst.file == HW_REG &&
> + inst->dst.fixed_hw_reg.file == BRW_GENERAL_REGISTER_FILE) {
> + last_fixed_grf_write = n;
> + } else if (inst->dst.file != BAD_FILE) {
> + add_barrier_deps(n);
> + }
> +
> + if (inst->mlen > 0) {
> + for (int i = 0; i < v->implied_mrf_writes(inst); i++) {
Eh? Aren't you missing a line here? Namely:
add_dep(last_mrf_write[inst->base_mrf + i], n);
The FS has it, and I doubt it's superfluous.
> + last_mrf_write[inst->base_mrf + i] = n;
> + }
> + }
> +
> + if (inst->conditional_mod) {
> + add_dep(last_conditional_mod, n, 0);
> + last_conditional_mod = n;
> + }
> + }
> +
> + /* bottom-to-top dependencies: WAR */
> + memset(last_grf_write, 0, sizeof(last_grf_write));
> + memset(last_mrf_write, 0, sizeof(last_mrf_write));
> + last_conditional_mod = NULL;
> + last_fixed_grf_write = NULL;
> +
> + exec_node *node;
> + exec_node *prev;
> + for (node = instructions.get_tail(), prev = node->prev;
> + !node->is_head_sentinel();
> + node = prev, prev = node->prev) {
> + schedule_node *n = (schedule_node *)node;
> + vec4_instruction *inst = n->inst;
> +
> + /* write-after-read deps. */
> + for (int i = 0; i < 3; i++) {
> + if (inst->src[i].file == GRF) {
> + add_dep(n, last_grf_write[inst->src[i].reg]);
> + } else if (inst->src[i].file == HW_REG &&
> + (inst->src[i].fixed_hw_reg.file ==
> + BRW_GENERAL_REGISTER_FILE)) {
> + add_dep(n, last_fixed_grf_write);
> + } else if (inst->src[i].file != BAD_FILE &&
> + inst->src[i].file != IMM &&
> + inst->src[i].file != UNIFORM) {
!= ATTR assertion would be nice here too.
> + assert(inst->src[i].file != MRF);
> + add_barrier_deps(n);
> + }
> + }
> +
> + for (int i = 0; i < inst->mlen; i++) {
> + /* It looks like the MRF regs are released in the send
> + * instruction once it's sent, not when the result comes
> + * back.
> + */
> + add_dep(n, last_mrf_write[inst->base_mrf + i], 2);
> + }
> +
> + if (inst->predicate) {
> + add_dep(n, last_conditional_mod);
> + }
> +
> + /* Update the things this instruction wrote, so earlier reads
> + * can mark this as WAR dependency.
> + */
> + if (inst->dst.file == GRF) {
> + last_grf_write[inst->dst.reg] = n;
> + } else if (inst->dst.file == MRF) {
> + last_mrf_write[inst->dst.reg] = n;
> + } else if (inst->dst.file == HW_REG &&
> + inst->dst.fixed_hw_reg.file == BRW_GENERAL_REGISTER_FILE) {
> + last_fixed_grf_write = n;
> + } else if (inst->dst.file != BAD_FILE) {
> + add_barrier_deps(n);
> + }
> +
> + if (inst->mlen > 0) {
> + for (int i = 0; i < v->implied_mrf_writes(inst); i++) {
> + last_mrf_write[inst->base_mrf + i] = n;
> + }
> + }
> +
> + if (inst->conditional_mod) {
> + last_conditional_mod = n;
> + }
> + }
> +}
> +
> +void
> +instruction_scheduler::schedule_instructions(vec4_instruction *next_block_header)
> +{
> + int time = 0;
> +
> + /* Remove non-DAG heads from the list. */
> + foreach_list_safe(node, &instructions) {
> + schedule_node *n = (schedule_node *)node;
> + if (n->parent_count != 0)
> + n->remove();
> + }
> +
> + while (!instructions.is_empty()) {
> + schedule_node *chosen = NULL;
> + int chosen_time = 0;
> +
> + foreach_list(node, &instructions) {
> + schedule_node *n = (schedule_node *)node;
> +
> + if (!chosen || n->unblocked_time < chosen_time) {
> + chosen = n;
> + chosen_time = n->unblocked_time;
> + }
> + }
> +
> + /* Schedule this instruction. */
> + assert(chosen);
> + chosen->remove();
> + next_block_header->insert_before(chosen->inst);
> + instructions_to_schedule--;
> +
> + /* Bump the clock. If we expected a delay for scheduling, then bump the
> + * clock to reflect that. VS instructions take 2 cycles, since the
> + * SIMD4 units the hw is composed of get loaded twice, once for each
> + * half of a vec8 register.
> + */
> + time = MAX2(time + 2, chosen_time);
> +
> + /* Now that we've scheduled a new instruction, some of its
> + * children can be promoted to the list of instructions ready to
> + * be scheduled. Update the children's unblocked time for this
> + * DAG edge as we do so.
> + */
> + for (int i = 0; i < chosen->child_count; i++) {
> + schedule_node *child = chosen->children[i];
> +
> + child->unblocked_time = MAX2(child->unblocked_time,
> + time + chosen->child_latency[i]);
> +
> + child->parent_count--;
> + if (child->parent_count == 0) {
> + instructions.push_tail(child);
> + }
> + }
> +
> + /* Shared resource: the mathbox. There's one per EU (on later
> + * generations, it's even more limited pre-gen6), so if we send
I actually don't think this is true, at least not on IVB...there's just
the ALU pipe and the EM pipe, and they're both available and can process
things...
But I suppose that's a separate change that we ought to do in the FS as
well (if it's true).
> + * something off to it then the next math isn't going to make
> + * progress until the first is done.
> + */
> + if (chosen->inst->is_math()) {
> + foreach_list(node, &instructions) {
> + schedule_node *n = (schedule_node *)node;
> +
> + if (n->inst->is_math())
> + n->unblocked_time = MAX2(n->unblocked_time,
> + time + chosen->latency);
> + }
> + }
> + }
> +
> + if (unlikely(INTEL_DEBUG & DEBUG_VS))
> + printf("vs esimated execution time: %d\n", time);
Typo: "estimated" not esimated.
> +
> + assert(instructions_to_schedule == 0);
> +}
> +
> +} /* anonymous namespace */
> +
> +void
> +vec4_visitor::opt_schedule_instructions()
> +{
> + vec4_instruction *next_block_header = (vec4_instruction *)instructions.head;
> + instruction_scheduler sched(this, mem_ctx, prog_data->total_grf);
> +
> + while (!next_block_header->is_tail_sentinel()) {
> + /* Add things to be scheduled until we get to a new BB. */
> + while (!next_block_header->is_tail_sentinel()) {
> + vec4_instruction *inst = next_block_header;
> + next_block_header = (vec4_instruction *)next_block_header->next;
> +
> + sched.add_inst(inst);
> + if (inst->opcode == BRW_OPCODE_IF ||
> + inst->opcode == BRW_OPCODE_ELSE ||
> + inst->opcode == BRW_OPCODE_ENDIF ||
> + inst->opcode == BRW_OPCODE_DO ||
> + inst->opcode == BRW_OPCODE_WHILE ||
> + inst->opcode == BRW_OPCODE_BREAK ||
> + inst->opcode == BRW_OPCODE_CONTINUE) {
> + break;
> + }
Would be nicer with the helper:
if (inst->is_control_flow())
break;
You'll need the a patch (just posted to the list) to make it available:
i965: Move is_math/is_tex/is_control_flow() to backend_instruction.
> + }
> + sched.calculate_deps();
> + sched.schedule_instructions(next_block_header);
> + }
> +
> + this->live_intervals_valid = false;
> +}
>
A final dumb observation: a lot of the scheduler boilerplate is
identical between the VS and FS. This could probably be shared by:
1. Making schedule_node use backend_instruction pointers.
2. Creating an instruction_scheduler base class and two subclasses:
class instruction_scheduler {
virtual calculate_deps() = 0;
...everything else...
}
Since calculate_deps() actually looks at vec4/fs instructions, it
should probably be pure virtual and implemented by the subclasses.
Every other function is the same boilerplate and could be shared.
More information about the mesa-dev
mailing list