[Mesa-stable] [PATCH 2/2] i965: Support "unlimited" compute shader scratch space.
Francisco Jerez
currojerez at riseup.net
Sun Jun 19 01:51:06 UTC 2016
Francisco Jerez <currojerez at riseup.net> writes:
> Kenneth Graunke <kenneth at whitecape.org> writes:
>
>> Ivybridge and Baytrail have a pretty harsh limit of 12kB scratch space
>> per thread. However, we can exceed this limit with some clever tricks
>> and gain effectively unlimited scratch space.
>>
>> Later platforms have a 2MB limit, which is much more reasonable, but
>> we may as well apply the same trick there.
>>
>> We can probably extend this trick to other stages, but would need to
>> adjust the shader code for the different thread payload layouts.
>>
>
> Hmm, I believe the scratch space pointer has the exact same format in
> the payload for all stages (otherwise the dataport wouldn't be able to
> decode the stateless message header regardless of the stage), but the
> exact interpretation of bits 0-9 of the same dword changes slightly
> across stages, and I'm not sure that the mapping between scratch offset
> and FFTID is as straightforward for all fixed functions as it is for
> compute shaders, so it seems reasonable to me to enable this on the
> compute stage only for the moment.
>
>> Cc: "12.0" <mesa-stable at lists.freedesktop.org>
>> Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=96505
>> Signed-off-by: Kenneth Graunke <kenneth at whitecape.org>
>> ---
>> src/mesa/drivers/dri/i965/brw_fs.cpp | 50 +++++++++++++++++++++++--------
>> src/mesa/drivers/dri/i965/gen7_cs_state.c | 6 ++--
>> 2 files changed, 41 insertions(+), 15 deletions(-)
>>
>> diff --git a/src/mesa/drivers/dri/i965/brw_fs.cpp b/src/mesa/drivers/dri/i965/brw_fs.cpp
>> index cadf905..9075918 100644
>> --- a/src/mesa/drivers/dri/i965/brw_fs.cpp
>> +++ b/src/mesa/drivers/dri/i965/brw_fs.cpp
>> @@ -5993,19 +5993,45 @@ fs_visitor::allocate_registers(bool allow_spilling)
>> prog_data->total_scratch = ALIGN(last_scratch, 1024);
>> max_scratch_size = 12 * 1024;
>> }
>> - }
>>
>> - /* We currently only support up to 2MB of scratch space. If we
>> - * need to support more eventually, the documentation suggests
>> - * that we could allocate a larger buffer, and partition it out
>> - * ourselves. We'd just have to undo the hardware's address
>> - * calculation by subtracting (FFTID * Per Thread Scratch Space)
>> - * and then add FFTID * (Larger Per Thread Scratch Space).
>> - *
>> - * See 3D-Media-GPGPU Engine > Media GPGPU Pipeline >
>> - * Thread Group Tracking > Local Memory/Scratch Space.
>> - */
>> - assert(prog_data->total_scratch < max_scratch_size);
>> + if (prog_data->total_scratch >= max_scratch_size) {
>> + /* Normally, the hardware computes a pointer to the scratch
>> + * space region for our thread for us. This comes with a
>> + * limitation - each thread can only use a limited amount
>> + * of scratch space. To support more than that limit, we
>> + * can subtract the hardware's offset and add our own:
>> + *
>> + * Scratch Space Pointer +=
>> + * FFTID * (prog_data->total_scratch - max_scratch_size);
>> + *
>> + * See 3D-Media-GPGPU Engine > Media GPGPU Pipeline >
>> + * Thread Group Tracking > Local Memory/Scratch Space.
>> + */
>> + const fs_builder bld =
>> + fs_builder(this, cfg->blocks[0], (fs_inst *)
>> + cfg->blocks[0]->start()).exec_all();
>
> The fs_builder() constructor that takes an fs_inst initializes the
> default annotation and execution controls based on the instruction
> passed as argument, which I'm not sure is what you want -- I suggest you
> do something like the following instead:
>
> | const fs_builder bld =
> | fs_builder(this, 1).at(cfg->blocks[0], cfg->blocks[0]->start())
> | .exec_all(),
>
> Note that the cast (which could lead to invalid memory access if the
> first basic block in the program happened to be empty -- not sure if
> that's actually possible for compute shaders right now) shouldn't be
> necessary if you use fs_builder::at(), which doesn't expect more than an
> exec_node object as argument to allow insertion in an empty basic block.
>
> The above also specifies a default execution size of one explicitly
> instead of taking it from the first instruction in the program, because
> the code you want to generate below is scalar and shouldn't care what
> the first instruction of the program is. I believe your code would have
> been translated to scalar eventually at codegen time because of the
> 1-wide fixed GRF destination registers, but it would have caused a
> temporary inconsistency in the IR between the exec_size field of the
> instructions and the actual behavior of the program, which could lead to
> breakage if we ever do any kind of IR processing after this [I was about
> to suggest emitting this hack between register allocation and post-RA
> scheduling, then recalled that the scheduler isn't terribly useful right
> now with instructions that use fixed GRFs, though it might be worth
> trying anyway].
>
>> + struct brw_reg g0_5 =
>> + retype(brw_vec1_grf(0, 5), BRW_REGISTER_TYPE_D);
>> + struct brw_reg g127 =
>> + retype(brw_vec1_grf(127, 0), BRW_REGISTER_TYPE_D);
>> +
>> + /* Multiply FFTID by max_scratch_size and shift the result
>> + * left by 10. Then subtract that from the scratch pointer.
>> + *
>
> I guess your code below doesn't actually need to shift unlike the
> comment says because the field contents are expressed in units of 2^10 B,
> so you just rely on the total_scratch and max_scratch_size values to
> be KB-aligned.
>
>> + * We need a register for temporary storage, but can't allocate
>> + * one post-register-allocation. However, since our code will
>> + * be emitted at the top of the program, we can safely use g127,
>> + * as it isn't part of the thread payload and can't be in use.
>> + */
>> + bld.AND(g127, g0_5, brw_imm_d(0x3ff));
>> + bld.MUL(g127, g127,
>> + brw_imm_d(prog_data->total_scratch - max_scratch_size));
>
> I think on pre-Gen8 this is going to overflow the usable range of the
> second source of the MUL instruction rather easily (for total_scratch
> greater than roughly 64 KB). I guess you could either do the
> multiplication in KB units (i.e. divide the immediate by 1024) and then
> SHL the result, which gives you an effective maximum of 64 MB per thread
> [Should be enough for anybody, right? ;)], or use two shift instructions
> to calculate the product of the FFTID with total_scratch and
> max_scratch_size and then accumulate the result (which would require
> limiting max_scratch_size to 8KB on IVB and changing the logic in the
> last patch slightly since both total_scratch and max_scratch_size would
> have to be powers of two for the left shift to work).
>
Another idea would be to do the multiplication directly in
max_scratch_size units (though that would require total_scratch to be
aligned to a max_scratch_size multiple which is trivial except on Gen7
where max_scratch_size may not be a power of two). That would give you
a maximum per thread scratch size of 768 MB on IVB CS, and 128 GB
elsewhere.
>> + bld.ADD(g0_5, g0_5, g127);
>> + }
>> + } else {
>> + /* We should extend the above hack for other stages someday. */
>> + assert(prog_data->total_scratch < max_scratch_size);
>> + }
>> }
>> }
>>
>> diff --git a/src/mesa/drivers/dri/i965/gen7_cs_state.c b/src/mesa/drivers/dri/i965/gen7_cs_state.c
>> index 5fb8829..b89a186 100644
>> --- a/src/mesa/drivers/dri/i965/gen7_cs_state.c
>> +++ b/src/mesa/drivers/dri/i965/gen7_cs_state.c
>> @@ -70,21 +70,21 @@ brw_upload_cs_state(struct brw_context *brw)
>> */
>> OUT_RELOC64(stage_state->scratch_bo,
>> I915_GEM_DOMAIN_RENDER, I915_GEM_DOMAIN_RENDER,
>> - ffs(stage_state->per_thread_scratch) - 11);
>> + MIN2(ffs(stage_state->per_thread_scratch) - 11, 11));
>> } else if (brw->is_haswell) {
>> /* Haswell's Per Thread Scratch Space is in the range [0, 10]
>> * where 0 = 2k, 1 = 4k, 2 = 8k, ..., 10 = 2M.
>> */
>> OUT_RELOC(stage_state->scratch_bo,
>> I915_GEM_DOMAIN_RENDER, I915_GEM_DOMAIN_RENDER,
>> - ffs(stage_state->per_thread_scratch) - 12);
>> + MIN2(ffs(stage_state->per_thread_scratch) - 12, 10));
>> } else {
>> /* Earlier platforms use the range [0, 11] to mean [1kB, 12kB]
>> * where 0 = 1kB, 1 = 2kB, 2 = 3kB, ..., 11 = 12kB.
>> */
>> OUT_RELOC(stage_state->scratch_bo,
>> I915_GEM_DOMAIN_RENDER, I915_GEM_DOMAIN_RENDER,
>> - stage_state->per_thread_scratch / 1024 - 1);
>> + MIN2(stage_state->per_thread_scratch / 1024, 12) - 1);
>> }
>> } else {
>> OUT_BATCH(0);
>> --
>> 2.8.3
>>
>> _______________________________________________
>> mesa-stable mailing list
>> mesa-stable at lists.freedesktop.org
>> https://lists.freedesktop.org/mailman/listinfo/mesa-stable
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 212 bytes
Desc: not available
URL: <https://lists.freedesktop.org/archives/mesa-stable/attachments/20160618/87a46b5a/attachment-0001.sig>
More information about the mesa-stable
mailing list