[Mesa-dev] [PATCH 41/41] SQUASH: i965/fs: Force a high register for the final FB write

Sat Sep 20 14:52:32 PDT 2014

On Sat, Sep 20, 2014 at 4:56 PM, Jason Ekstrand <jason at jlekstrand.net> wrote:
>
>
> On Sat, Sep 20, 2014 at 12:44 PM, Connor Abbott <cwabbott0 at gmail.com> wrote:
>>
>> On Sat, Sep 20, 2014 at 1:23 PM, Jason Ekstrand <jason at jlekstrand.net>
>> wrote:
>> > ---
>> >  src/mesa/drivers/dri/i965/brw_fs_reg_allocate.cpp | 24
>> > ++++++++++++++++++++++-
>> >  src/mesa/drivers/dri/i965/intel_screen.h          |  5 +++++
>> >  2 files changed, 28 insertions(+), 1 deletion(-)
>> >
>> > diff --git a/src/mesa/drivers/dri/i965/brw_fs_reg_allocate.cpp
>> > b/src/mesa/drivers/dri/i965/brw_fs_reg_allocate.cpp
>> > index b42f1e5..41e8855 100644
>> > --- a/src/mesa/drivers/dri/i965/brw_fs_reg_allocate.cpp
>> > +++ b/src/mesa/drivers/dri/i965/brw_fs_reg_allocate.cpp
>> > @@ -116,8 +116,10 @@ brw_alloc_reg_set(struct intel_screen *screen, int
>> > reg_width)
>> >     /* Compute the total number of registers across all classes. */
>> >     int ra_reg_count = 0;
>> >     for (int i = 0; i < class_count; i++) {
>> > +      screen->wm_reg_sets[index].class_first_reg[i] = ra_reg_count;
>> >        ra_reg_count += base_reg_count - (class_sizes[i] - 1);
>> >     }
>> > +   screen->wm_reg_sets[index].class_first_reg[class_count] =
>> > ra_reg_count;
>> >
>> >     uint8_t *ra_reg_to_grf = ralloc_array(screen, uint8_t,
>> > ra_reg_count);
>> >     struct ra_regs *regs = ra_alloc_reg_set(screen, ra_reg_count);
>> > @@ -469,9 +471,29 @@ fs_visitor::assign_regs(bool allow_spilling)
>> >     }
>> >
>> >     setup_payload_interference(g, payload_node_count,
>> > first_payload_node);
>> > -   if (brw->gen >= 7)
>> > +   if (brw->gen >= 7) {
>> >        setup_mrf_hack_interference(g, first_mrf_hack_node);
>> >
>> > +      foreach_in_list(fs_inst, inst, &instructions) {
>> > +         /* When we do send-from-GRF for FB writes, we need to ensure
>> > that
>> > +          * the last write instruction sends from a high register.
>> > This is
>> > +          * because the vertex fetcher wants to start filling the low
>> > +          * payload registers while the pixel data port is still
>> > working on
>> > +          * writing out the memory.  If we don't do this, we get
>> > rendering
>> > +          * artifacts.
>> > +          *
>> > +          * We could just do "something high".  Instead, we just pick
>> > the
>> > +          * highest register that works.
>> > +          */
>> > +         if (inst->opcode == FS_OPCODE_FB_WRITE && inst->eot) {
>> > +            int size = virtual_grf_sizes[inst->src[0].reg];
>> > +            int reg = screen->wm_reg_sets[rsi].class_first_reg[size] -
>> > 1;
>> > +            ra_set_node_reg(g, inst->src[0].reg, reg);
>> > +            break;
>> > +         }
>> > +      }
>> > +   }
>> > +
>> >     if (dispatch_width > 8) {
>> >        /* In 16-wide dispatch we have an issue where a compressed
>> >         * instruction is actually two instructions executed
>> > simultaneiously.
>> > diff --git a/src/mesa/drivers/dri/i965/intel_screen.h
>> > b/src/mesa/drivers/dri/i965/intel_screen.h
>> > index 945f6f5..8bc6abd 100644
>> > --- a/src/mesa/drivers/dri/i965/intel_screen.h
>> > +++ b/src/mesa/drivers/dri/i965/intel_screen.h
>> > @@ -90,6 +90,11 @@ struct intel_screen
>> >        int classes[16];
>> >
>> >        /**
>> > +       * Array of the first ra reg in eahch ra class.
>>
>> Typo.
>
>
> Thanks
>
>>
>>
>> > +       */
>> > +      int class_first_reg[17];
>> > +
>> > +      /**
>> >         * Mapping for register-allocated objects in *regs to the first
>> >         * GRF for that object.
>> >         */
>> > --
>> > 2.1.0
>> >
>> > _______________________________________________
>> > mesa-dev mailing list
>> > mesa-dev at lists.freedesktop.org
>> > http://lists.freedesktop.org/mailman/listinfo/mesa-dev
>>
>> I think it would be less confusing if instead of introducing a
>> "class_first_reg" array you added a class_last_reg array and then did
>> class_last_reg[size - 1] - iiuc, you're relying on the fact that the
>> first reg of one class minus one is the last reg of the previous
>> class, which isn't obvious and it takes a while to realize what's
>> going on.
>
>
> That's fair.  I can change that easily enough.  It doesn't really matter
> that much.  I just kind of liked the fact that  last_reg[n] - last_reg[n-1]
> == class_size.  But it doesn't matter and what you suggested would be far
> more obvious.
>
>>
>> BTW, just so you know, I don't think it's generally a good idea to set
>> more than one node to be pre-allocated to the same register, since
>> things will blow up if those nodes interfere. It's probably OK here
>> since I can't think of a way that an input register preallocated to
>> the highest register would interfere with the input to the fb write,
>
>
> Yes, I'm aware of the problem.  However, the only things we preallocate are
> payload registers which only ever go at the beginning and the final FB write
> which goes at the end.  As long as we don't start preallocating more things,
> I think we'll be fine.  Also, isn't there some cap on how many payload
> registers we can have?  If so then that already solves our problem.

Yeah, that's what I meant. I'm not sure if the varying payload
registers can fill up the entire register space (and hence one gets
allocated to the same register as the fb write load_payload
destination), but I guess it doesn't matter too much.

>
>>
>> but in the future we'll probably want to think about how to handle
>> this in a more sophisticated way. This is sort of on a tangent, but
>> with SSA register allocation we should probably allocate load_payload
>> destinations before anything else and then treat load_payload as a
>> series of writes to preallocated registers, since spilling the result
>> of a load_payload seems like a silly thing to do and then we don't
>> have to deal with allocating a contiguous set of registers which
>
>
> If only that were true...  The problem is that we will *always* have the
> problem of contiguous registers.  Once we add fp64 to the mix, we will have
> registers of size 1 and 2 in SIMD8 and registers of size 1 (Unsigned single
> words), 2 (dwords, floats, etc) and 4 (doubles) in SIMD16.  Sure, we can
> probably preallocate payload registers, but we'll never get away from
> handling multiple sizes.

True, there's still that problem. And don't forget that we want to be
able to allocate scalar/uniform registers, so now you have registers
of size 1, 8, and 16... I think Periera's puzzle-solving allocator
(http://compilers.cs.ucla.edu/fernando/publications/papers/PhdDiss.pdf)
handles the case where the register classes are aligned, so if we make
SIMD16 registers aligned like we currently do it might work. It's
definitely a difficult problem though, and it'll take a lot of work to
solve.

>
>>
>> breaks some nice properties of SSA register allocation. Then, we could
>> have a simple heuristic to allocate the source of an fb write as high
>> as possible, or just allocate all load_payload instructions as high as
>> possible.
>
>
> I've actually been thinking about the register allocation problem quite a
> bit lately.  One of the things I've got on my todo list is to do some
> research and figure out if we can be doing it better.  There may be a better
> way to handle the multiple sizes problem than we're doing currently.
> Unfortunately, I can't really say much now because I haven't dug into it
> very far yet.