[Mesa-dev] [PATCH 3/3] gallivm: add integer and unsigned mod arit functions.

Roland Scheidegger sroland at vmware.com
Mon Feb 27 17:28:34 PST 2012


Am 27.02.2012 21:26, schrieb Jose Fonseca:
> 
> 
> ----- Original Message -----
>> On Mon, Feb 20, 2012 at 01:50:43PM -0800, Jose Fonseca wrote:
>>>
>>>
>>> ----- Original Message -----
>>>>
>>>>
>>>> ----- Original Message -----
>>>>> On Sat, Feb 18, 2012 at 4:20 AM, Jose Fonseca
>>>>> <jfonseca at vmware.com>
>>>>> wrote:
>>>>>> ----- Original Message -----
>>>>>>> On Fri, Feb 17, 2012 at 9:46 PM, Jose Fonseca
>>>>>>> <jfonseca at vmware.com>
>>>>>>> wrote:
>>>>>>>> Dave,
>>>>>>>>
>>>>>>>> Ideally there should be only one lp_build_mod() which will
>>>>>>>> invoke
>>>>>>>> LLVMBuildSRem or LLVMBuildURem depending on the value of
>>>>>>>> bld->type.sign.  The point being that this allows the same
>>>>>>>> code
>>>>>>>> generation logic to seemingly target any type without
>>>>>>>> having
>>>>>>>> to
>>>>>>>> worry too much which target it is targeting.
>>>>>>>
>>>>>>> Yeah I agree with this for now, but I'm starting to think a
>>>>>>> lot
>>>>>>> of
>>>>>>> this stuff is redunant once I looked at what Tom has done.
>>>>>>>
>>>>>>> The thing is TGSI doesn't have that many crazy options where
>>>>>>> you
>>>>>>> are
>>>>>>> going to be targetting instructions at the wrong type, and
>>>>>>> wrapping
>>>>>>> all the basic llvm interfaces with an extra type layer seems
>>>>>>> to
>>>>>>> me
>>>>>>> long term like a waste of time.
>>>>>>
>>>>>> So far llvmpipe's TGSI->LLVM IR has only been targetting
>>>>>> floating
>>>>>> point SIMD instructions.
>>>>>>
>>>>>> But truth is that many simple fragment shaders can be
>>>>>> partially
>>>>>> done with 8bit and 16bit SIMD integers, if values are
>>>>>> represented
>>>>>> in 8bit unorm and 16 bit unorms. The throughput for these
>>>>>> will be
>>>>>> much higher, as not only we can squeeze more elements, they
>>>>>> take
>>>>>> less cycles, and the hardware has several arithmetic units.
>>>>>>
>>>>>> The point of those lp_build_xxx functions is to handle this
>>>>>> transparently. See, e.g., how lp_build_mul handles fixed
>>>>>> point.
>>>>>> Currently this is only used for blending, but the hope is to
>>>>>> eventually use it on TGSI translation of simple fragment
>>>>>> shaders.
>>>>>>
>>>>>> Maybe not the case for the desktop GPUs, but I also heard
>>>>>> that
>>>>>> some
>>>>>> low powered devices have shader engines w/ 8bit unorms.
>>>>>>
>>>>>> But of course, not all opcodes can be done correctly: and
>>>>>> URem/SRem
>>>>>> might not be one we care.
>>>>>>
>>>>>>> I'm happy for now to finish the integer support in the same
>>>>>>> style
>>>>>>> as
>>>>>>> the current code, but I think moving forward afterwards it
>>>>>>> might
>>>>>>> be
>>>>>>> worth investigating a more direct instruction emission
>>>>>>> scheme.
>>>>>>
>>>>>> If you wanna invoke LLVMBuildURem/LLVMBuildSRem directly from
>>>>>> tgsi
>>>>>> translation I'm fine with it. We can always generalize
>>>>>>
>>>>>>> Perhaps
>>>>>>> Tom can comment also from his experience.
>>>>>>
>>>>>> BTW, Tom, I just now noticed that there are two action
>>>>>> versions
>>>>>> for
>>>>>> add:
>>>>>>
>>>>>> /* TGSI_OPCODE_ADD (CPU Only) */
>>>>>> static void
>>>>>> add_emit_cpu(
>>>>>>   const struct lp_build_tgsi_action * action,
>>>>>>   struct lp_build_tgsi_context * bld_base,
>>>>>>   struct lp_build_emit_data * emit_data)
>>>>>> {
>>>>>>   emit_data->output[emit_data->chan] =
>>>>>>   lp_build_add(&bld_base->base,
>>>>>>                                   emit_data->args[0],
>>>>>>                                   emit_data->args[1]);
>>>>>> }
>>>>>>
>>>>>> /* TGSI_OPCODE_ADD */
>>>>>> static void
>>>>>> add_emit(
>>>>>>   const struct lp_build_tgsi_action * action,
>>>>>>   struct lp_build_tgsi_context * bld_base,
>>>>>>   struct lp_build_emit_data * emit_data)
>>>>>> {
>>>>>>   emit_data->output[emit_data->chan] = LLVMBuildFAdd(
>>>>>>                                bld_base->base.gallivm->builder,
>>>>>>                                emit_data->args[0],
>>>>>>                                emit_data->args[1], "");
>>>>>> }
>>>>>>
>>>>>> Why is this necessary? lp_build_add will already call
>>>>>> LLVMBuildFAdd
>>>>>> internally as appropriate.
>>>>>>
>>>>>> Is this because some of the functions in lp_bld_arit.c will
>>>>>> emit
>>>>>> x86 intrinsics? If so then a "no-x86-intrinsic" flag in the
>>>>>> build
>>>>>> context would achieve the same effect with less code
>>>>>> duplication.
>>>>>>
>>>>>> If possible I'd prefer a single version of these actions. If
>>>>>> not,
>>>>>> then I'd prefer have them split: lp_build_action_cpu.c and
>>>>>> lp_build_action_gpu.
>>>>>
>>>>> Yes, this is why a split them up.  I can add that flag and
>>>>> merge
>>>>> the
>>>>> actions together.
>>>>
>>>> That would be nice. Thanks.
>>>
>>> Tom, actually I've been looking more at the code, thinking about
>>> this, and I'm not so sure what's best anymore.
>>>
>>> I'd appreciate your honest answer: do you think the stuff in
>>> lp_bld_arit.[ch] of any use for GPUs in general (or AMD's in
>>> particular), or is it just an hinderance?
>>>
>>> As I said before, for CPUs, this abstraction is useful, to allow to
>>> convert TGSI (and other fixed function state) -> fixed point SIMD
>>> instructions, which yield the highest throughput on CPUs. Because
>>> LLVM native types are not expressive enough for fixed function,
>>> etc.
>>>
>>> But if this is useless for GPUs (i.e, if LLVM's native types are
>>> sufficient), then we can make this abstraction a CPU only thing.
>>>
>>
>> I don't think the lp_bld_arit.c functions are really useful for GPUs,
>> and I don't rely on any of them in the R600 backend.  Also, I was
>> looking
>> through those functions again and the problem is more than just x86
>> intrinsics.  Some of them assume vector types, which I don't use at
>> all.
> 
> Does that mean that the R600 generates/consumes only scalar expressions?
R600 (HD2xxx) up to Evergreen/Northern Islands (HD6xxx except HD69xx)
are VLIW5. So that's not exactly scalar but it doesn't quite fit any
simd vector model (as you can have 5 different instructions per
instruction slot). (Cayman, aka HD69xx is VLIW4, and the new GCN chips
aka HD7xxx indeed use a scalar model, as does nvidia.).
The vectors as they are used by llvmpipe are of course there in gpus
too, but these are really mostly hidden (amd chips generally have a
logical simd width of 64 and nvidia 32 - amd calls this the wavefront
size and nvidia the warp size but in any case you still emit scalar
looking instructions which are really implicit vectors).
So I guess using explicit vectors isn't really helping matters.
Maybe with intel gpus it would fit better as they have sort of
configurable simd width with more control. No idea though if it would
actually be useful.
Older chips certainly have some more (AoS) simd aspects to them but the
model doesn't quite fit neither.

Roland


> 
>> So, maybe it is best to keep them separate.
> 
> Yes, it seems so.
> 
> 
> Does anybody else working or planning in writting TGSI -> LLVM translation pass has other thoughts?
> 
> 
> If not, I'll eventually split the helpers in two kinds:
> - generic helpers that operate directly with LLVMTypeRef and LLVMValueRef
> - native SIMD abstraction, that operates with lp_build_type, for fixed point, etc.
> 
> 
> For the record, this is what you seem to use so far:
> 
> $ git grep '#include.*gallivm' src/gallium/drivers/r600/
> src/gallium/drivers/r600/r600_llvm.c:#include "gallivm/lp_bld_const.h"
> src/gallium/drivers/r600/r600_llvm.c:#include "gallivm/lp_bld_intr.h"
> src/gallium/drivers/r600/r600_llvm.c:#include "gallivm/lp_bld_gather.h"
> src/gallium/drivers/r600/r600_llvm.h:#include "gallivm/lp_bld_tgsi.h"
> 
> 
> 
> Jose
> _______________________________________________
> mesa-dev mailing list
> mesa-dev at lists.freedesktop.org
> http://lists.freedesktop.org/mailman/listinfo/mesa-dev



More information about the mesa-dev mailing list