[Mesa-dev] [PATCH 41/41] SQUASH: i965/fs: Force a high register for the final FB write

Sat Sep 20 12:44:19 PDT 2014

On Sat, Sep 20, 2014 at 1:23 PM, Jason Ekstrand <jason at jlekstrand.net> wrote:
> ---
>  src/mesa/drivers/dri/i965/brw_fs_reg_allocate.cpp | 24 ++++++++++++++++++++++-
>  src/mesa/drivers/dri/i965/intel_screen.h          |  5 +++++
>  2 files changed, 28 insertions(+), 1 deletion(-)
>
> diff --git a/src/mesa/drivers/dri/i965/brw_fs_reg_allocate.cpp b/src/mesa/drivers/dri/i965/brw_fs_reg_allocate.cpp
> index b42f1e5..41e8855 100644
> --- a/src/mesa/drivers/dri/i965/brw_fs_reg_allocate.cpp
> +++ b/src/mesa/drivers/dri/i965/brw_fs_reg_allocate.cpp
> @@ -116,8 +116,10 @@ brw_alloc_reg_set(struct intel_screen *screen, int reg_width)
>     /* Compute the total number of registers across all classes. */
>     int ra_reg_count = 0;
>     for (int i = 0; i < class_count; i++) {
> +      screen->wm_reg_sets[index].class_first_reg[i] = ra_reg_count;
>        ra_reg_count += base_reg_count - (class_sizes[i] - 1);
>     }
> +   screen->wm_reg_sets[index].class_first_reg[class_count] = ra_reg_count;
>
>     uint8_t *ra_reg_to_grf = ralloc_array(screen, uint8_t, ra_reg_count);
>     struct ra_regs *regs = ra_alloc_reg_set(screen, ra_reg_count);
> @@ -469,9 +471,29 @@ fs_visitor::assign_regs(bool allow_spilling)
>     }
>
>     setup_payload_interference(g, payload_node_count, first_payload_node);
> -   if (brw->gen >= 7)
> +   if (brw->gen >= 7) {
>        setup_mrf_hack_interference(g, first_mrf_hack_node);
>
> +      foreach_in_list(fs_inst, inst, &instructions) {
> +         /* When we do send-from-GRF for FB writes, we need to ensure that
> +          * the last write instruction sends from a high register.  This is
> +          * because the vertex fetcher wants to start filling the low
> +          * payload registers while the pixel data port is still working on
> +          * writing out the memory.  If we don't do this, we get rendering
> +          * artifacts.
> +          *
> +          * We could just do "something high".  Instead, we just pick the
> +          * highest register that works.
> +          */
> +         if (inst->opcode == FS_OPCODE_FB_WRITE && inst->eot) {
> +            int size = virtual_grf_sizes[inst->src[0].reg];
> +            int reg = screen->wm_reg_sets[rsi].class_first_reg[size] - 1;
> +            ra_set_node_reg(g, inst->src[0].reg, reg);
> +            break;
> +         }
> +      }
> +   }
> +
>     if (dispatch_width > 8) {
>        /* In 16-wide dispatch we have an issue where a compressed
>         * instruction is actually two instructions executed simultaneiously.
> diff --git a/src/mesa/drivers/dri/i965/intel_screen.h b/src/mesa/drivers/dri/i965/intel_screen.h
> index 945f6f5..8bc6abd 100644
> --- a/src/mesa/drivers/dri/i965/intel_screen.h
> +++ b/src/mesa/drivers/dri/i965/intel_screen.h
> @@ -90,6 +90,11 @@ struct intel_screen
>        int classes[16];
>
>        /**
> +       * Array of the first ra reg in eahch ra class.

Typo.

> +       */
> +      int class_first_reg[17];
> +
> +      /**
>         * Mapping for register-allocated objects in *regs to the first
>         * GRF for that object.
>         */
> --
> 2.1.0
>
> _______________________________________________
> mesa-dev mailing list
> mesa-dev at lists.freedesktop.org
> http://lists.freedesktop.org/mailman/listinfo/mesa-dev

I think it would be less confusing if instead of introducing a
"class_first_reg" array you added a class_last_reg array and then did
class_last_reg[size - 1] - iiuc, you're relying on the fact that the
first reg of one class minus one is the last reg of the previous
class, which isn't obvious and it takes a while to realize what's
going on.

BTW, just so you know, I don't think it's generally a good idea to set
more than one node to be pre-allocated to the same register, since
things will blow up if those nodes interfere. It's probably OK here
since I can't think of a way that an input register preallocated to
the highest register would interfere with the input to the fb write,
but in the future we'll probably want to think about how to handle
this in a more sophisticated way. This is sort of on a tangent, but
with SSA register allocation we should probably allocate load_payload
destinations before anything else and then treat load_payload as a
series of writes to preallocated registers, since spilling the result
of a load_payload seems like a silly thing to do and then we don't
have to deal with allocating a contiguous set of registers which
breaks some nice properties of SSA register allocation. Then, we could
have a simple heuristic to allocate the source of an fb write as high
as possible, or just allocate all load_payload instructions as high as
possible.