[Mesa-dev] [Mesa-stable] [PATCH 2/2] i965: Support "unlimited" compute shader scratch space.

Sun Jun 19 01:51:06 UTC 2016

Francisco Jerez <currojerez at riseup.net> writes:

> Kenneth Graunke <kenneth at whitecape.org> writes:
>
>> Ivybridge and Baytrail have a pretty harsh limit of 12kB scratch space
>> per thread.  However, we can exceed this limit with some clever tricks
>> and gain effectively unlimited scratch space.
>>
>> Later platforms have a 2MB limit, which is much more reasonable, but
>> we may as well apply the same trick there.
>>
>> We can probably extend this trick to other stages, but would need to
>> adjust the shader code for the different thread payload layouts.
>>
>
> Hmm, I believe the scratch space pointer has the exact same format in
> the payload for all stages (otherwise the dataport wouldn't be able to
> decode the stateless message header regardless of the stage), but the
> exact interpretation of bits 0-9 of the same dword changes slightly
> across stages, and I'm not sure that the mapping between scratch offset
> and FFTID is as straightforward for all fixed functions as it is for
> compute shaders, so it seems reasonable to me to enable this on the
> compute stage only for the moment.
>
>> Cc: "12.0" <mesa-stable at lists.freedesktop.org>
>> Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=96505
>> Signed-off-by: Kenneth Graunke <kenneth at whitecape.org>
>> ---
>>  src/mesa/drivers/dri/i965/brw_fs.cpp      | 50 +++++++++++++++++++++++--------
>>  src/mesa/drivers/dri/i965/gen7_cs_state.c |  6 ++--
>>  2 files changed, 41 insertions(+), 15 deletions(-)
>>
>> diff --git a/src/mesa/drivers/dri/i965/brw_fs.cpp b/src/mesa/drivers/dri/i965/brw_fs.cpp
>> index cadf905..9075918 100644
>> --- a/src/mesa/drivers/dri/i965/brw_fs.cpp
>> +++ b/src/mesa/drivers/dri/i965/brw_fs.cpp
>> @@ -5993,19 +5993,45 @@ fs_visitor::allocate_registers(bool allow_spilling)
>>              prog_data->total_scratch = ALIGN(last_scratch, 1024);
>>              max_scratch_size = 12 * 1024;
>>           }
>> -      }
>>  
>> -      /* We currently only support up to 2MB of scratch space.  If we
>> -       * need to support more eventually, the documentation suggests
>> -       * that we could allocate a larger buffer, and partition it out
>> -       * ourselves.  We'd just have to undo the hardware's address
>> -       * calculation by subtracting (FFTID * Per Thread Scratch Space)
>> -       * and then add FFTID * (Larger Per Thread Scratch Space).
>> -       *
>> -       * See 3D-Media-GPGPU Engine > Media GPGPU Pipeline >
>> -       * Thread Group Tracking > Local Memory/Scratch Space.
>> -       */
>> -      assert(prog_data->total_scratch < max_scratch_size);
>> +         if (prog_data->total_scratch >= max_scratch_size) {
>> +            /* Normally, the hardware computes a pointer to the scratch
>> +             * space region for our thread for us.  This comes with a
>> +             * limitation - each thread can only use a limited amount
>> +             * of scratch space.  To support more than that limit, we
>> +             * can subtract the hardware's offset and add our own:
>> +             *
>> +             * Scratch Space Pointer +=
>> +             *    FFTID * (prog_data->total_scratch - max_scratch_size);
>> +             *
>> +             * See 3D-Media-GPGPU Engine > Media GPGPU Pipeline >
>> +             * Thread Group Tracking > Local Memory/Scratch Space.
>> +             */
>> +            const fs_builder bld =
>> +               fs_builder(this, cfg->blocks[0], (fs_inst *)
>> +                          cfg->blocks[0]->start()).exec_all();
>
> The fs_builder() constructor that takes an fs_inst initializes the
> default annotation and execution controls based on the instruction
> passed as argument, which I'm not sure is what you want -- I suggest you
> do something like the following instead:
>
> |            const fs_builder bld =
> |               fs_builder(this, 1).at(cfg->blocks[0], cfg->blocks[0]->start())
> |                                  .exec_all(),
>
> Note that the cast (which could lead to invalid memory access if the
> first basic block in the program happened to be empty -- not sure if
> that's actually possible for compute shaders right now) shouldn't be
> necessary if you use fs_builder::at(), which doesn't expect more than an
> exec_node object as argument to allow insertion in an empty basic block.
>
> The above also specifies a default execution size of one explicitly
> instead of taking it from the first instruction in the program, because
> the code you want to generate below is scalar and shouldn't care what
> the first instruction of the program is.  I believe your code would have
> been translated to scalar eventually at codegen time because of the
> 1-wide fixed GRF destination registers, but it would have caused a
> temporary inconsistency in the IR between the exec_size field of the
> instructions and the actual behavior of the program, which could lead to
> breakage if we ever do any kind of IR processing after this [I was about
> to suggest emitting this hack between register allocation and post-RA
> scheduling, then recalled that the scheduler isn't terribly useful right
> now with instructions that use fixed GRFs, though it might be worth
> trying anyway].
>
>> +            struct brw_reg g0_5 =
>> +               retype(brw_vec1_grf(0, 5), BRW_REGISTER_TYPE_D);
>> +            struct brw_reg g127 =
>> +               retype(brw_vec1_grf(127, 0), BRW_REGISTER_TYPE_D);
>> +
>> +            /* Multiply FFTID by max_scratch_size and shift the result
>> +             * left by 10.  Then subtract that from the scratch pointer.
>> +             *
>
> I guess your code below doesn't actually need to shift unlike the
> comment says because the field contents are expressed in units of 2^10 B,
> so you just rely on the total_scratch and max_scratch_size values to
> be KB-aligned.
>
>> +             * We need a register for temporary storage, but can't allocate
>> +             * one post-register-allocation.  However, since our code will
>> +             * be emitted at the top of the program, we can safely use g127,
>> +             * as it isn't part of the thread payload and can't be in use.
>> +             */
>> +            bld.AND(g127, g0_5, brw_imm_d(0x3ff));
>> +            bld.MUL(g127, g127,
>> +                    brw_imm_d(prog_data->total_scratch - max_scratch_size));
>
> I think on pre-Gen8 this is going to overflow the usable range of the
> second source of the MUL instruction rather easily (for total_scratch
> greater than roughly 64 KB).  I guess you could either do the
> multiplication in KB units (i.e. divide the immediate by 1024) and then
> SHL the result, which gives you an effective maximum of 64 MB per thread
> [Should be enough for anybody, right? ;)], or use two shift instructions
> to calculate the product of the FFTID with total_scratch and
> max_scratch_size and then accumulate the result (which would require
> limiting max_scratch_size to 8KB on IVB and changing the logic in the
> last patch slightly since both total_scratch and max_scratch_size would
> have to be powers of two for the left shift to work).
>

Another idea would be to do the multiplication directly in
max_scratch_size units (though that would require total_scratch to be
aligned to a max_scratch_size multiple which is trivial except on Gen7
where max_scratch_size may not be a power of two).  That would give you
a maximum per thread scratch size of 768 MB on IVB CS, and 128 GB
elsewhere.

>> +            bld.ADD(g0_5, g0_5, g127);
>> +         }
>> +      } else {
>> +         /* We should extend the above hack for other stages someday. */
>> +         assert(prog_data->total_scratch < max_scratch_size);
>> +      }
>>     }
>>  }
>>  
>> diff --git a/src/mesa/drivers/dri/i965/gen7_cs_state.c b/src/mesa/drivers/dri/i965/gen7_cs_state.c
>> index 5fb8829..b89a186 100644
>> --- a/src/mesa/drivers/dri/i965/gen7_cs_state.c
>> +++ b/src/mesa/drivers/dri/i965/gen7_cs_state.c
>> @@ -70,21 +70,21 @@ brw_upload_cs_state(struct brw_context *brw)
>>            */
>>           OUT_RELOC64(stage_state->scratch_bo,
>>                       I915_GEM_DOMAIN_RENDER, I915_GEM_DOMAIN_RENDER,
>> -                     ffs(stage_state->per_thread_scratch) - 11);
>> +                     MIN2(ffs(stage_state->per_thread_scratch) - 11, 11));
>>        } else if (brw->is_haswell) {
>>           /* Haswell's Per Thread Scratch Space is in the range [0, 10]
>>            * where 0 = 2k, 1 = 4k, 2 = 8k, ..., 10 = 2M.
>>            */
>>           OUT_RELOC(stage_state->scratch_bo,
>>                     I915_GEM_DOMAIN_RENDER, I915_GEM_DOMAIN_RENDER,
>> -                   ffs(stage_state->per_thread_scratch) - 12);
>> +                   MIN2(ffs(stage_state->per_thread_scratch) - 12, 10));
>>        } else {
>>           /* Earlier platforms use the range [0, 11] to mean [1kB, 12kB]
>>            * where 0 = 1kB, 1 = 2kB, 2 = 3kB, ..., 11 = 12kB.
>>            */
>>           OUT_RELOC(stage_state->scratch_bo,
>>                     I915_GEM_DOMAIN_RENDER, I915_GEM_DOMAIN_RENDER,
>> -                   stage_state->per_thread_scratch / 1024 - 1);
>> +                   MIN2(stage_state->per_thread_scratch / 1024, 12) - 1);
>>        }
>>     } else {
>>        OUT_BATCH(0);
>> -- 
>> 2.8.3
>>
>> _______________________________________________
>> mesa-stable mailing list
>> mesa-stable at lists.freedesktop.org
>> https://lists.freedesktop.org/mailman/listinfo/mesa-stable
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 212 bytes
Desc: not available
URL: <https://lists.freedesktop.org/archives/mesa-dev/attachments/20160618/87a46b5a/attachment.sig>