[Mesa-stable] [PATCH 2/2] i965: Support "unlimited" compute shader scratch space.

Sun Jun 19 01:21:32 UTC 2016

Kenneth Graunke <kenneth at whitecape.org> writes:

> Ivybridge and Baytrail have a pretty harsh limit of 12kB scratch space
> per thread.  However, we can exceed this limit with some clever tricks
> and gain effectively unlimited scratch space.
>
> Later platforms have a 2MB limit, which is much more reasonable, but
> we may as well apply the same trick there.
>
> We can probably extend this trick to other stages, but would need to
> adjust the shader code for the different thread payload layouts.
>

Hmm, I believe the scratch space pointer has the exact same format in
the payload for all stages (otherwise the dataport wouldn't be able to
decode the stateless message header regardless of the stage), but the
exact interpretation of bits 0-9 of the same dword changes slightly
across stages, and I'm not sure that the mapping between scratch offset
and FFTID is as straightforward for all fixed functions as it is for
compute shaders, so it seems reasonable to me to enable this on the
compute stage only for the moment.

> Cc: "12.0" <mesa-stable at lists.freedesktop.org>
> Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=96505
> Signed-off-by: Kenneth Graunke <kenneth at whitecape.org>
> ---
>  src/mesa/drivers/dri/i965/brw_fs.cpp      | 50 +++++++++++++++++++++++--------
>  src/mesa/drivers/dri/i965/gen7_cs_state.c |  6 ++--
>  2 files changed, 41 insertions(+), 15 deletions(-)
>
> diff --git a/src/mesa/drivers/dri/i965/brw_fs.cpp b/src/mesa/drivers/dri/i965/brw_fs.cpp
> index cadf905..9075918 100644
> --- a/src/mesa/drivers/dri/i965/brw_fs.cpp
> +++ b/src/mesa/drivers/dri/i965/brw_fs.cpp
> @@ -5993,19 +5993,45 @@ fs_visitor::allocate_registers(bool allow_spilling)
>              prog_data->total_scratch = ALIGN(last_scratch, 1024);
>              max_scratch_size = 12 * 1024;
>           }
> -      }
>  
> -      /* We currently only support up to 2MB of scratch space.  If we
> -       * need to support more eventually, the documentation suggests
> -       * that we could allocate a larger buffer, and partition it out
> -       * ourselves.  We'd just have to undo the hardware's address
> -       * calculation by subtracting (FFTID * Per Thread Scratch Space)
> -       * and then add FFTID * (Larger Per Thread Scratch Space).
> -       *
> -       * See 3D-Media-GPGPU Engine > Media GPGPU Pipeline >
> -       * Thread Group Tracking > Local Memory/Scratch Space.
> -       */
> -      assert(prog_data->total_scratch < max_scratch_size);
> +         if (prog_data->total_scratch >= max_scratch_size) {
> +            /* Normally, the hardware computes a pointer to the scratch
> +             * space region for our thread for us.  This comes with a
> +             * limitation - each thread can only use a limited amount
> +             * of scratch space.  To support more than that limit, we
> +             * can subtract the hardware's offset and add our own:
> +             *
> +             * Scratch Space Pointer +=
> +             *    FFTID * (prog_data->total_scratch - max_scratch_size);
> +             *
> +             * See 3D-Media-GPGPU Engine > Media GPGPU Pipeline >
> +             * Thread Group Tracking > Local Memory/Scratch Space.
> +             */
> +            const fs_builder bld =
> +               fs_builder(this, cfg->blocks[0], (fs_inst *)
> +                          cfg->blocks[0]->start()).exec_all();

The fs_builder() constructor that takes an fs_inst initializes the
default annotation and execution controls based on the instruction
passed as argument, which I'm not sure is what you want -- I suggest you
do something like the following instead:

|            const fs_builder bld =
|               fs_builder(this, 1).at(cfg->blocks[0], cfg->blocks[0]->start())
|                                  .exec_all(),

Note that the cast (which could lead to invalid memory access if the
first basic block in the program happened to be empty -- not sure if
that's actually possible for compute shaders right now) shouldn't be
necessary if you use fs_builder::at(), which doesn't expect more than an
exec_node object as argument to allow insertion in an empty basic block.

The above also specifies a default execution size of one explicitly
instead of taking it from the first instruction in the program, because
the code you want to generate below is scalar and shouldn't care what
the first instruction of the program is.  I believe your code would have
been translated to scalar eventually at codegen time because of the
1-wide fixed GRF destination registers, but it would have caused a
temporary inconsistency in the IR between the exec_size field of the
instructions and the actual behavior of the program, which could lead to
breakage if we ever do any kind of IR processing after this [I was about
to suggest emitting this hack between register allocation and post-RA
scheduling, then recalled that the scheduler isn't terribly useful right
now with instructions that use fixed GRFs, though it might be worth
trying anyway].

> +            struct brw_reg g0_5 =
> +               retype(brw_vec1_grf(0, 5), BRW_REGISTER_TYPE_D);
> +            struct brw_reg g127 =
> +               retype(brw_vec1_grf(127, 0), BRW_REGISTER_TYPE_D);
> +
> +            /* Multiply FFTID by max_scratch_size and shift the result
> +             * left by 10.  Then subtract that from the scratch pointer.
> +             *

I guess your code below doesn't actually need to shift unlike the
comment says because the field contents are expressed in units of 2^10 B,
so you just rely on the total_scratch and max_scratch_size values to
be KB-aligned.

> +             * We need a register for temporary storage, but can't allocate
> +             * one post-register-allocation.  However, since our code will
> +             * be emitted at the top of the program, we can safely use g127,
> +             * as it isn't part of the thread payload and can't be in use.
> +             */
> +            bld.AND(g127, g0_5, brw_imm_d(0x3ff));
> +            bld.MUL(g127, g127,
> +                    brw_imm_d(prog_data->total_scratch - max_scratch_size));

I think on pre-Gen8 this is going to overflow the usable range of the
second source of the MUL instruction rather easily (for total_scratch
greater than roughly 64 KB).  I guess you could either do the
multiplication in KB units (i.e. divide the immediate by 1024) and then
SHL the result, which gives you an effective maximum of 64 MB per thread
[Should be enough for anybody, right? ;)], or use two shift instructions
to calculate the product of the FFTID with total_scratch and
max_scratch_size and then accumulate the result (which would require
limiting max_scratch_size to 8KB on IVB and changing the logic in the
last patch slightly since both total_scratch and max_scratch_size would
have to be powers of two for the left shift to work).

> +            bld.ADD(g0_5, g0_5, g127);
> +         }
> +      } else {
> +         /* We should extend the above hack for other stages someday. */
> +         assert(prog_data->total_scratch < max_scratch_size);
> +      }
>     }
>  }
>  
> diff --git a/src/mesa/drivers/dri/i965/gen7_cs_state.c b/src/mesa/drivers/dri/i965/gen7_cs_state.c
> index 5fb8829..b89a186 100644
> --- a/src/mesa/drivers/dri/i965/gen7_cs_state.c
> +++ b/src/mesa/drivers/dri/i965/gen7_cs_state.c
> @@ -70,21 +70,21 @@ brw_upload_cs_state(struct brw_context *brw)
>            */
>           OUT_RELOC64(stage_state->scratch_bo,
>                       I915_GEM_DOMAIN_RENDER, I915_GEM_DOMAIN_RENDER,
> -                     ffs(stage_state->per_thread_scratch) - 11);
> +                     MIN2(ffs(stage_state->per_thread_scratch) - 11, 11));
>        } else if (brw->is_haswell) {
>           /* Haswell's Per Thread Scratch Space is in the range [0, 10]
>            * where 0 = 2k, 1 = 4k, 2 = 8k, ..., 10 = 2M.
>            */
>           OUT_RELOC(stage_state->scratch_bo,
>                     I915_GEM_DOMAIN_RENDER, I915_GEM_DOMAIN_RENDER,
> -                   ffs(stage_state->per_thread_scratch) - 12);
> +                   MIN2(ffs(stage_state->per_thread_scratch) - 12, 10));
>        } else {
>           /* Earlier platforms use the range [0, 11] to mean [1kB, 12kB]
>            * where 0 = 1kB, 1 = 2kB, 2 = 3kB, ..., 11 = 12kB.
>            */
>           OUT_RELOC(stage_state->scratch_bo,
>                     I915_GEM_DOMAIN_RENDER, I915_GEM_DOMAIN_RENDER,
> -                   stage_state->per_thread_scratch / 1024 - 1);
> +                   MIN2(stage_state->per_thread_scratch / 1024, 12) - 1);
>        }
>     } else {
>        OUT_BATCH(0);
> -- 
> 2.8.3
>
> _______________________________________________
> mesa-stable mailing list
> mesa-stable at lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/mesa-stable
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 212 bytes
Desc: not available
URL: <https://lists.freedesktop.org/archives/mesa-stable/attachments/20160618/18993370/attachment.sig>