[Mesa-dev] [PATCH 41/41] SQUASH: i965/fs: Force a high register for the final FB write

Sat Sep 20 15:45:39 PDT 2014

On Sat, Sep 20, 2014 at 5:58 PM, Jason Ekstrand <jason at jlekstrand.net> wrote:
>
> On Sep 20, 2014 2:52 PM, "Connor Abbott" <cwabbott0 at gmail.com> wrote:
>>
>> On Sat, Sep 20, 2014 at 4:56 PM, Jason Ekstrand <jason at jlekstrand.net>
>> wrote:
>> >
>> >
>> > On Sat, Sep 20, 2014 at 12:44 PM, Connor Abbott <cwabbott0 at gmail.com>
>> > wrote:
>> >>
>> >> On Sat, Sep 20, 2014 at 1:23 PM, Jason Ekstrand <jason at jlekstrand.net>
>> >> wrote:
>> >> > ---
>> >> >  src/mesa/drivers/dri/i965/brw_fs_reg_allocate.cpp | 24
>> >> > ++++++++++++++++++++++-
>> >> >  src/mesa/drivers/dri/i965/intel_screen.h          |  5 +++++
>> >> >  2 files changed, 28 insertions(+), 1 deletion(-)
>> >> >
>> >> > diff --git a/src/mesa/drivers/dri/i965/brw_fs_reg_allocate.cpp
>> >> > b/src/mesa/drivers/dri/i965/brw_fs_reg_allocate.cpp
>> >> > index b42f1e5..41e8855 100644
>> >> > --- a/src/mesa/drivers/dri/i965/brw_fs_reg_allocate.cpp
>> >> > +++ b/src/mesa/drivers/dri/i965/brw_fs_reg_allocate.cpp
>> >> > @@ -116,8 +116,10 @@ brw_alloc_reg_set(struct intel_screen *screen,
>> >> > int
>> >> > reg_width)
>> >> >     /* Compute the total number of registers across all classes. */
>> >> >     int ra_reg_count = 0;
>> >> >     for (int i = 0; i < class_count; i++) {
>> >> > +      screen->wm_reg_sets[index].class_first_reg[i] = ra_reg_count;
>> >> >        ra_reg_count += base_reg_count - (class_sizes[i] - 1);
>> >> >     }
>> >> > +   screen->wm_reg_sets[index].class_first_reg[class_count] =
>> >> > ra_reg_count;
>> >> >
>> >> >     uint8_t *ra_reg_to_grf = ralloc_array(screen, uint8_t,
>> >> > ra_reg_count);
>> >> >     struct ra_regs *regs = ra_alloc_reg_set(screen, ra_reg_count);
>> >> > @@ -469,9 +471,29 @@ fs_visitor::assign_regs(bool allow_spilling)
>> >> >     }
>> >> >
>> >> >     setup_payload_interference(g, payload_node_count,
>> >> > first_payload_node);
>> >> > -   if (brw->gen >= 7)
>> >> > +   if (brw->gen >= 7) {
>> >> >        setup_mrf_hack_interference(g, first_mrf_hack_node);
>> >> >
>> >> > +      foreach_in_list(fs_inst, inst, &instructions) {
>> >> > +         /* When we do send-from-GRF for FB writes, we need to
>> >> > ensure
>> >> > that
>> >> > +          * the last write instruction sends from a high register.
>> >> > This is
>> >> > +          * because the vertex fetcher wants to start filling the
>> >> > low
>> >> > +          * payload registers while the pixel data port is still
>> >> > working on
>> >> > +          * writing out the memory.  If we don't do this, we get
>> >> > rendering
>> >> > +          * artifacts.
>> >> > +          *
>> >> > +          * We could just do "something high".  Instead, we just
>> >> > pick
>> >> > the
>> >> > +          * highest register that works.
>> >> > +          */
>> >> > +         if (inst->opcode == FS_OPCODE_FB_WRITE && inst->eot) {
>> >> > +            int size = virtual_grf_sizes[inst->src[0].reg];
>> >> > +            int reg = screen->wm_reg_sets[rsi].class_first_reg[size]
>> >> > -
>> >> > 1;
>> >> > +            ra_set_node_reg(g, inst->src[0].reg, reg);
>> >> > +            break;
>> >> > +         }
>> >> > +      }
>> >> > +   }
>> >> > +
>> >> >     if (dispatch_width > 8) {
>> >> >        /* In 16-wide dispatch we have an issue where a compressed
>> >> >         * instruction is actually two instructions executed
>> >> > simultaneiously.
>> >> > diff --git a/src/mesa/drivers/dri/i965/intel_screen.h
>> >> > b/src/mesa/drivers/dri/i965/intel_screen.h
>> >> > index 945f6f5..8bc6abd 100644
>> >> > --- a/src/mesa/drivers/dri/i965/intel_screen.h
>> >> > +++ b/src/mesa/drivers/dri/i965/intel_screen.h
>> >> > @@ -90,6 +90,11 @@ struct intel_screen
>> >> >        int classes[16];
>> >> >
>> >> >        /**
>> >> > +       * Array of the first ra reg in eahch ra class.
>> >>
>> >> Typo.
>> >
>> >
>> > Thanks
>> >
>> >>
>> >>
>> >> > +       */
>> >> > +      int class_first_reg[17];
>> >> > +
>> >> > +      /**
>> >> >         * Mapping for register-allocated objects in *regs to the
>> >> > first
>> >> >         * GRF for that object.
>> >> >         */
>> >> > --
>> >> > 2.1.0
>> >> >
>> >> > _______________________________________________
>> >> > mesa-dev mailing list
>> >> > mesa-dev at lists.freedesktop.org
>> >> > http://lists.freedesktop.org/mailman/listinfo/mesa-dev
>> >>
>> >> I think it would be less confusing if instead of introducing a
>> >> "class_first_reg" array you added a class_last_reg array and then did
>> >> class_last_reg[size - 1] - iiuc, you're relying on the fact that the
>> >> first reg of one class minus one is the last reg of the previous
>> >> class, which isn't obvious and it takes a while to realize what's
>> >> going on.
>> >
>> >
>> > That's fair.  I can change that easily enough.  It doesn't really matter
>> > that much.  I just kind of liked the fact that  last_reg[n] -
>> > last_reg[n-1]
>> > == class_size.  But it doesn't matter and what you suggested would be
>> > far
>> > more obvious.
>> >
>> >>
>> >> BTW, just so you know, I don't think it's generally a good idea to set
>> >> more than one node to be pre-allocated to the same register, since
>> >> things will blow up if those nodes interfere. It's probably OK here
>> >> since I can't think of a way that an input register preallocated to
>> >> the highest register would interfere with the input to the fb write,
>> >
>> >
>> > Yes, I'm aware of the problem.  However, the only things we preallocate
>> > are
>> > payload registers which only ever go at the beginning and the final FB
>> > write
>> > which goes at the end.  As long as we don't start preallocating more
>> > things,
>> > I think we'll be fine.  Also, isn't there some cap on how many payload
>> > registers we can have?  If so then that already solves our problem.
>>
>> Yeah, that's what I meant. I'm not sure if the varying payload
>> registers can fill up the entire register space (and hence one gets
>> allocated to the same register as the fb write load_payload
>> destination), but I guess it doesn't matter too much.
>>
>> >
>> >>
>> >> but in the future we'll probably want to think about how to handle
>> >> this in a more sophisticated way. This is sort of on a tangent, but
>> >> with SSA register allocation we should probably allocate load_payload
>> >> destinations before anything else and then treat load_payload as a
>> >> series of writes to preallocated registers, since spilling the result
>> >> of a load_payload seems like a silly thing to do and then we don't
>> >> have to deal with allocating a contiguous set of registers which
>> >
>> >
>> > If only that were true...  The problem is that we will *always* have the
>> > problem of contiguous registers.  Once we add fp64 to the mix, we will
>> > have
>> > registers of size 1 and 2 in SIMD8 and registers of size 1 (Unsigned
>> > single
>> > words), 2 (dwords, floats, etc) and 4 (doubles) in SIMD16.  Sure, we can
>> > probably preallocate payload registers, but we'll never get away from
>> > handling multiple sizes.
>>
>> True, there's still that problem. And don't forget that we want to be
>> able to allocate scalar/uniform registers, so now you have registers
>> of size 1, 8, and 16...
>
> Yeah. I think right now we always do those with pulls so its not a problem
> yet.  We may want to do that in the future though.

Well, I meant more realizing when computations are done uniformly
across all the EU's and then storing the result in only 1 register
instead of 8 or 16 to save register pressure. I'm not sure if that's
what you were thinking of.

>
>> I think Periera's puzzle-solving allocator
>> (http://compilers.cs.ucla.edu/fernando/publications/papers/PhdDiss.pdf)
>> handles the case where the register classes are aligned, so if we make
>> SIMD16 registers aligned like we currently do it might work. It's
>> definitely a difficult problem though, and it'll take a lot of work to
>> solve.
>
> Thanks for the link. I'll give it a look.
> --Jason
>>
>> >
>> >>
>> >> breaks some nice properties of SSA register allocation. Then, we could
>> >> have a simple heuristic to allocate the source of an fb write as high
>> >> as possible, or just allocate all load_payload instructions as high as
>> >> possible.
>> >
>> >
>> > I've actually been thinking about the register allocation problem quite
>> > a
>> > bit lately.  One of the things I've got on my todo list is to do some
>> > research and figure out if we can be doing it better.  There may be a
>> > better
>> > way to handle the multiple sizes problem than we're doing currently.
>> > Unfortunately, I can't really say much now because I haven't dug into it
>> > very far yet.