[Mesa-dev] [PATCH] i965/fs: Reduce the interference between payload regs and virtual GRFs.

Tue Oct 16 19:04:48 PDT 2012

On 10/16/2012 06:05 PM, Eric Anholt wrote:
> Kenneth Graunke <kenneth at whitecape.org> writes:
>
>> On 10/15/2012 04:06 PM, Eric Anholt wrote:
>>> Improves performance of the Lightsmark penumbra shadows scene by 15.7% +/-
>>> 1.0% (n=15), by eliminating register spilling. (tested by smashing the list of
>>> scenes to have all other scenes have 0 duration -- includes additional
>>> rendering of scene description text that normally doesn't appear in that
>>> scene)
>>> ---
>>>    src/mesa/drivers/dri/i965/brw_fs.h                |    2 +
>>>    src/mesa/drivers/dri/i965/brw_fs_reg_allocate.cpp |  164 ++++++++++++++++++---
>>>    2 files changed, 147 insertions(+), 19 deletions(-)
>>>
>>> diff --git a/src/mesa/drivers/dri/i965/brw_fs.h b/src/mesa/drivers/dri/i965/brw_fs.h
>>> index a71783c..ad717c9 100644
>>> --- a/src/mesa/drivers/dri/i965/brw_fs.h
>>> +++ b/src/mesa/drivers/dri/i965/brw_fs.h
>>> @@ -235,6 +235,8 @@ public:
>>>       void assign_urb_setup();
>>>       bool assign_regs();
>>>       void assign_regs_trivial();
>>> +   void setup_payload_interference(struct ra_graph *g, int payload_reg_count,
>>> +                                   int first_payload_node);
>>>       int choose_spill_reg(struct ra_graph *g);
>>>       void spill_reg(int spill_reg);
>>>       void split_virtual_grfs();
>>> diff --git a/src/mesa/drivers/dri/i965/brw_fs_reg_allocate.cpp b/src/mesa/drivers/dri/i965/brw_fs_reg_allocate.cpp
>>> index 7b778d6..bd9789f 100644
>>> --- a/src/mesa/drivers/dri/i965/brw_fs_reg_allocate.cpp
>>> +++ b/src/mesa/drivers/dri/i965/brw_fs_reg_allocate.cpp
>>> @@ -163,27 +163,154 @@ brw_alloc_reg_set(struct brw_context *brw, int reg_width, int base_reg_count)
>>>    /**
>>>     * Sets up interference between thread payload registers and the virtual GRFs
>>>     * to be allocated for program temporaries.
>>> + *
>>> + * We want to be able to reallocate the payload for our virtual GRFs, notably
>>> + * because the setup coefficients for a full set of 16 FS inputs takes up 8 of
>>> + * our 128 registers.
>>> + *
>>> + * The layout of the payload registers is:
>>> + *
>>> + * 0..nr_payload_regs-1: fixed function setup (including bary coordinates).
>>> + * nr_payload_regs..nr_payload_regs+curb_read_lengh-1: uniform data
>>> + * nr_payload_regs+curb_read_lengh..first_non_payload_grf-1: setup coefficients.
>>> + *
>>> + * And we have payload_node_count nodes covering these registers in order
>>> + * (note that in 16-wide, a node is two registers).
>>>     */
>>> -static void
>>> -brw_setup_payload_interference(struct ra_graph *g,
>>> -                               int payload_reg_count,
>>> -                               int first_payload_node,
>>> -                               int reg_node_count)
>>> +void
>>> +fs_visitor::setup_payload_interference(struct ra_graph *g,
>>> +                                       int payload_node_count,
>>> +                                       int first_payload_node)
>>>    {
>>> -   for (int i = 0; i < payload_reg_count; i++) {
>>> -      /* Mark each payload reg node as being allocated to its physical register.
>>> +   int reg_width = c->dispatch_width / 8;
>>> +   int loop_depth = 0;
>>> +   int loop_end_ip = 0;
>>> +
>>> +   int payload_last_use_ip[payload_node_count];
>>> +   memset(payload_last_use_ip, 0, sizeof(payload_last_use_ip));
>>> +   int ip = 0;
>>> +   foreach_list(node, &this->instructions) {
>>> +      fs_inst *inst = (fs_inst *)node;
>>> +
>>> +      switch (inst->opcode) {
>>> +      case BRW_OPCODE_DO:
>>> +         loop_depth++;
>>> +
>>> +         /* Since payload regs are deffed only at the start of the shader
>>> +          * execution, any uses of the payload within a loop mean the live
>>> +          * interval extends to the end of the outermost loop.  Find the ip of
>>> +          * the end now.
>>> +          */
>>> +         if (loop_depth == 1) {
>>> +            int scan_depth = loop_depth;
>>> +            int scan_ip = ip;
>>> +            for (fs_inst *scan_inst = (fs_inst *)inst->next;
>>> +                 scan_depth > 0;
>>> +                 scan_inst = (fs_inst *)scan_inst->next) {
>>> +               switch (scan_inst->opcode) {
>>> +               case BRW_OPCODE_DO:
>>> +                  scan_depth++;
>>> +                  break;
>>> +               case BRW_OPCODE_WHILE:
>>> +                  scan_depth--;
>>> +                  break;
>>> +               default:
>>> +                  break;
>>> +               }
>>> +               scan_ip++;
>>> +            }
>>> +            loop_end_ip = scan_ip;
>>> +         }
>>> +         break;
>>> +      case BRW_OPCODE_WHILE:
>>> +         loop_depth--;
>>> +         break;
>>> +      default:
>>> +         break;
>>> +      }
>>
>> Wow, it's unfortunate that you have to do this.  Essentially, for each
>> instruction in a loop, you walk through all the instructions, to the end
>> of the loop.  That's big O(fail). :(
>
> Huh?  This is "for the top-level loop instruction, count to the end of
> that loop".
>
> I mean, we could keep ip in the instructions, but then you get to update
> it all over when you inst->remove().

Ugh, you're right.  I misread.

On the first DO, you walk through every instruction in the loop, then 
proceed to the next instruction (the first instruction inside the loop, 
rather than fast forwarding to the end of the block).  For some reason, 
I thought that you re-walked through the loop.  But you don't: you only 
walk it for DO instructions.  So it's not O(n^2) like I feared.

And you can't simply skip the instructions within the loop like I 
foolishly suggested.  You need to process them to find register uses.

I then failed to read the code in a second way, and had to work through 
the following example to make sure it wasn't broken:

  1. DO
  2.    inst2
  3.    inst3
  4.    DO
  5.       inst6
  6.       inst7
  7.    WHILE
  8.    inst8
  9.    inst9
10.    inst10
11. WHILE

Line 1 triggers a walk through lines 1-11, setting loop_end_ip and 
use_ip to line 11.  Line 2 and 3 just get processed; use_ip is 11.

I thought line 4 triggered a walk through the inner loop, but it 
doesn't, since loop_depth > 1.  No walking occurs at all...and 
loop_end_ip remains 11.  Which is correct.  I was afraid it would get 
set to 7 (the end of the inner loop), at which point use_ip for lines 
8-10 would be bogus.

So I believe your patch is correct.  I just can't read code today.

Reviewed-by: Kenneth Graunke <kenneth at whitecape.org>

That said, I think you could structure it more clearly.  Perhaps 
something like:

/**
  * Given a DO instruction (and ip count), return the ip for the
  * matching WHILE instruction.
  */
static int
find_loop_end(fs_inst *do_inst, int do_ip)
{
    int loop_depth = 1;
    int ip = do_ip;

    for (fs_inst *inst = do_inst->next;
         loop_depth > 0;
         inst = (fs_inst *) inst->next) {
       ip++;
       if (inst->opcode == BRW_OPCODE_DO)
          loop_depth++;
       else if (inst->opcode == BRW_OPCODE_WHILE)
          loop_depth--;
    }
    return ip;
}

This is short and simple enough that it's easy to read.

And then in your function:

    int ip = 0;
    int outermost_loop_end_ip = 0;
    foreach_list(node, &this->instructions) {
       fs_inst *inst = (fs_inst *) node;

      /* Variables defined outside of a loop are alive for the whole
       * duration of the loop.  Since payload registers are always
       * defined at the very top of the shader, any use within a loop
       * means that the live interval extends to the end of the outermost
       * enclosing loop.  If this the start of a top-level loop, find
       * the IP for the matching WHILE now.
       */
      if (inst->opcode == BRW_OPCODE_DO && ip >= outermost_loop_end_ip)
         outermost_loop_end_ip = find_loop_end(inst, ip);

      int use_ip = MAX2(ip, outermost_loop_end_ip);

      //...the rest of your code...
   }

If you like that (and it really does work), feel free to use my 
approach.  Since I'm on vacation the rest of the week, you shouldn't 
wait for me to review a respin.  Just go ahead and push it.

Or, if you don't like it, you can just push your version.