[Mesa-dev] [PATCH] i965/fs: Prefer more-critical instructions of the same age in LIFO scheduling.

Wed Oct 30 19:55:59 CET 2013

On 28 October 2013 15:34, Eric Anholt <eric at anholt.net> wrote:

> When faced with a million instructions that all became candidates at the
> same time (none of which individually reduce register pressure), the ones
> on the critical path are more likely to be the ones that will free up some
> candidates soon.
>
> shader-db:
> total instructions in shared programs: 1681070 -> 1681070 (0.00%)
> instructions in affected programs:     0 -> 0
> GAINED:                                40
> LOST:                                  74
>
> Fixes indistinguishable-from-hanging behavior in GLES3conform's
> uniform_buffer_object_max_uniform_block_size test, regressed by
> c3c9a8c85758796a26b48e484286e6b6f5a5299a.  Given that
> 93bd627d5a6c485948b94488e6cd53a06b7ebdcf was unlocked by that commit, the
> net effect on 16-wide program count is still quite positive, and I think
> this should give us more stable scheduling (less dependency on original
> instruction emit order).
>
> Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=70943
> ---
>  .../drivers/dri/i965/brw_schedule_instructions.cpp | 77
> +++++++++++++++++-----
>  1 file changed, 62 insertions(+), 15 deletions(-)
>
> diff --git a/src/mesa/drivers/dri/i965/brw_schedule_instructions.cpp
> b/src/mesa/drivers/dri/i965/brw_schedule_instructions.cpp
> index 06e3a8b..04c875f 100644
> --- a/src/mesa/drivers/dri/i965/brw_schedule_instructions.cpp
> +++ b/src/mesa/drivers/dri/i965/brw_schedule_instructions.cpp
> @@ -68,6 +68,7 @@ public:
>        this->child_count = 0;
>        this->parent_count = 0;
>        this->unblocked_time = 0;
> +      this->cand_generation = 0;
>
>        /* We can't measure Gen6 timings directly but expect them to be much
>         * closer to Gen7 than Gen4.
> @@ -91,6 +92,12 @@ public:
>     int latency;
>
>     /**
> +    * Which iteration of pushing groups of children onto the candidates
> list
> +    * this node was a part of.
> +    */
> +   unsigned cand_generation;
> +
> +   /**
>      * This is the sum of the instruction's latency plus the maximum delay
> of
>      * its children, or just the issue_time if it's a leaf node.
>      */
> @@ -973,26 +980,63 @@
> fs_instruction_scheduler::choose_instruction_to_schedule()
>         * variables so that we can avoid register spilling, or get 16-wide
>         * shaders which naturally do a better job of hiding instruction
>         * latency.
> -       *
> -       * To do so, schedule our instructions in a roughly LIFO/depth-first
> -       * order: when new instructions become available as a result of
> -       * scheduling something, choose those first so that our result
> -       * hopefully is consumed quickly.
> -       *
> -       * The exception is messages that generate more than one result
> -       * register (AKA texturing).  In those cases, the LIFO search would
> -       * normally tend to choose them quickly (because scheduling the
> -       * previous message not only unblocked the children using its
> result,
> -       * but also the MRF setup for the next sampler message, which in
> turn
> -       * unblocks the next sampler message).
>         */
>        foreach_list(node, &instructions) {
>           schedule_node *n = (schedule_node *)node;
>           fs_inst *inst = (fs_inst *)n->inst;
>
> -         chosen = n;
> -         if (v->brw->gen >= 7 || inst->regs_written <= 1)
> -            break;
> +         if (!chosen) {
> +            chosen = n;
> +            continue;
> +         }
> +
> +         /* Prefer instructions that recently became available for
> scheduling.
> +          * These are the things that are most likely to (eventually)
> make a
> +          * variable dead and reduce register pressure.  Typical register
> +          * pressure estimates don't work for us because most of our
> pressure
> +          * comes from texturing, where no single instruction to schedule
> will
> +          * make a vec4 value dead.
> +          */
> +         if (n->cand_generation > chosen->cand_generation) {
> +            chosen = n;
> +            continue;
> +         } else if (n->cand_generation < chosen->cand_generation) {
> +            continue;
> +         }
> +
> +         /* On MRF-using chips, prefer non-SEND instructions.  Because we
> +          * prefer instructions that just became candidates, we'll end up
> in a
> +          * pattern of scheduling a SEND, then the MRFs for the next SEND,
> +          * then the next SEND, then the MRFs, etc., without ever
> consuming
> +          * the results of a send.
>

On the first few readings of this comment, I was confused because I didn't
realize that the sentence beginning "Because..." was describing a
counterfactual.  Maybe change this to "If we don't do this, then because we
prefer...".

>
> +          */
> +         if (v->brw->gen < 7) {
> +            fs_inst *chosen_inst = (fs_inst *)chosen->inst;
> +
> +            if (inst->regs_written <= 1 && chosen_inst->regs_written > 1)
> {
>

It would be nice to have a comment here explaining that we're using
inst->regs_written > 1 as a proxy for determining whether something is a
send instruction--this isn't exactly correct, but it's close enough because
the only SENDs that it misses are things like spills and fb_writes, and
those aren't involved in the pathological case we're trying to fix.

With those minor changes, series is:

Reviewed-by: Paul Berry <stereotype441 at gmail.com>

> +               chosen = n;
> +               continue;
> +            } else if (inst->regs_written > chosen_inst->regs_written) {
> +               continue;
> +            }
> +         }
> +
> +         /* For instructions pushed on the cands list at the same time,
> prefer
> +          * the one with the highest delay to the end of the program.
>  This is
> +          * most likely to have its values able to be consumed first
> (such as
> +          * for a large tree of lowered ubo loads, which appear reversed
> in
> +          * the instruction stream with respect to when they can be
> consumed).
> +          */
> +         if (n->delay > chosen->delay) {
> +            chosen = n;
> +            continue;
> +         } else if (n->delay < chosen->delay) {
> +            continue;
> +         }
> +
> +         /* If all other metrics are equal, we prefer the first
> instruction in
> +          * the list (program execution).
> +          */
>        }
>     }
>
> @@ -1048,6 +1092,7 @@
> instruction_scheduler::schedule_instructions(backend_instruction
> *next_block_hea
>          n->remove();
>     }
>
> +   unsigned cand_generation = 1;
>     while (!instructions.is_empty()) {
>        schedule_node *chosen = choose_instruction_to_schedule();
>
> @@ -1090,6 +1135,7 @@
> instruction_scheduler::schedule_instructions(backend_instruction
> *next_block_hea
>              bv->dump_instruction(child->inst);
>           }
>
> +         child->cand_generation = cand_generation;
>          child->parent_count--;
>          if (child->parent_count == 0) {
>              if (debug) {
> @@ -1098,6 +1144,7 @@
> instruction_scheduler::schedule_instructions(backend_instruction
> *next_block_hea
>             instructions.push_head(child);
>          }
>        }
> +      cand_generation++;
>
>        /* Shared resource: the mathbox.  There's one mathbox per EU on
> Gen6+
>         * but it's more limited pre-gen6, so if we send something off to
> it then
> --
> 1.8.4.rc3
>
> _______________________________________________
> mesa-dev mailing list
> mesa-dev at lists.freedesktop.org
> http://lists.freedesktop.org/mailman/listinfo/mesa-dev
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.freedesktop.org/archives/mesa-dev/attachments/20131030/3506fd98/attachment.html>