[Mesa-dev] [PATCH 3/3] gallivm: add integer and unsigned mod arit functions.

Tue Feb 28 01:42:04 PST 2012

----- Original Message -----
> Am 27.02.2012 21:26, schrieb Jose Fonseca:
> > 
> > 
> > ----- Original Message -----
> >> On Mon, Feb 20, 2012 at 01:50:43PM -0800, Jose Fonseca wrote:
> >>>
> >>>
> >>> ----- Original Message -----
> >>>>
> >>>>
> >>>> ----- Original Message -----
> >>>>> On Sat, Feb 18, 2012 at 4:20 AM, Jose Fonseca
> >>>>> <jfonseca at vmware.com>
> >>>>> wrote:
> >>>>>> ----- Original Message -----
> >>>>>>> On Fri, Feb 17, 2012 at 9:46 PM, Jose Fonseca
> >>>>>>> <jfonseca at vmware.com>
> >>>>>>> wrote:
> >>>>>>>> Dave,
> >>>>>>>>
> >>>>>>>> Ideally there should be only one lp_build_mod() which will
> >>>>>>>> invoke
> >>>>>>>> LLVMBuildSRem or LLVMBuildURem depending on the value of
> >>>>>>>> bld->type.sign.  The point being that this allows the same
> >>>>>>>> code
> >>>>>>>> generation logic to seemingly target any type without
> >>>>>>>> having
> >>>>>>>> to
> >>>>>>>> worry too much which target it is targeting.
> >>>>>>>
> >>>>>>> Yeah I agree with this for now, but I'm starting to think a
> >>>>>>> lot
> >>>>>>> of
> >>>>>>> this stuff is redunant once I looked at what Tom has done.
> >>>>>>>
> >>>>>>> The thing is TGSI doesn't have that many crazy options where
> >>>>>>> you
> >>>>>>> are
> >>>>>>> going to be targetting instructions at the wrong type, and
> >>>>>>> wrapping
> >>>>>>> all the basic llvm interfaces with an extra type layer seems
> >>>>>>> to
> >>>>>>> me
> >>>>>>> long term like a waste of time.
> >>>>>>
> >>>>>> So far llvmpipe's TGSI->LLVM IR has only been targetting
> >>>>>> floating
> >>>>>> point SIMD instructions.
> >>>>>>
> >>>>>> But truth is that many simple fragment shaders can be
> >>>>>> partially
> >>>>>> done with 8bit and 16bit SIMD integers, if values are
> >>>>>> represented
> >>>>>> in 8bit unorm and 16 bit unorms. The throughput for these
> >>>>>> will be
> >>>>>> much higher, as not only we can squeeze more elements, they
> >>>>>> take
> >>>>>> less cycles, and the hardware has several arithmetic units.
> >>>>>>
> >>>>>> The point of those lp_build_xxx functions is to handle this
> >>>>>> transparently. See, e.g., how lp_build_mul handles fixed
> >>>>>> point.
> >>>>>> Currently this is only used for blending, but the hope is to
> >>>>>> eventually use it on TGSI translation of simple fragment
> >>>>>> shaders.
> >>>>>>
> >>>>>> Maybe not the case for the desktop GPUs, but I also heard
> >>>>>> that
> >>>>>> some
> >>>>>> low powered devices have shader engines w/ 8bit unorms.
> >>>>>>
> >>>>>> But of course, not all opcodes can be done correctly: and
> >>>>>> URem/SRem
> >>>>>> might not be one we care.
> >>>>>>
> >>>>>>> I'm happy for now to finish the integer support in the same
> >>>>>>> style
> >>>>>>> as
> >>>>>>> the current code, but I think moving forward afterwards it
> >>>>>>> might
> >>>>>>> be
> >>>>>>> worth investigating a more direct instruction emission
> >>>>>>> scheme.
> >>>>>>
> >>>>>> If you wanna invoke LLVMBuildURem/LLVMBuildSRem directly from
> >>>>>> tgsi
> >>>>>> translation I'm fine with it. We can always generalize
> >>>>>>
> >>>>>>> Perhaps
> >>>>>>> Tom can comment also from his experience.
> >>>>>>
> >>>>>> BTW, Tom, I just now noticed that there are two action
> >>>>>> versions
> >>>>>> for
> >>>>>> add:
> >>>>>>
> >>>>>> /* TGSI_OPCODE_ADD (CPU Only) */
> >>>>>> static void
> >>>>>> add_emit_cpu(
> >>>>>>   const struct lp_build_tgsi_action * action,
> >>>>>>   struct lp_build_tgsi_context * bld_base,
> >>>>>>   struct lp_build_emit_data * emit_data)
> >>>>>> {
> >>>>>>   emit_data->output[emit_data->chan] =
> >>>>>>   lp_build_add(&bld_base->base,
> >>>>>>                                   emit_data->args[0],
> >>>>>>                                   emit_data->args[1]);
> >>>>>> }
> >>>>>>
> >>>>>> /* TGSI_OPCODE_ADD */
> >>>>>> static void
> >>>>>> add_emit(
> >>>>>>   const struct lp_build_tgsi_action * action,
> >>>>>>   struct lp_build_tgsi_context * bld_base,
> >>>>>>   struct lp_build_emit_data * emit_data)
> >>>>>> {
> >>>>>>   emit_data->output[emit_data->chan] = LLVMBuildFAdd(
> >>>>>>                                bld_base->base.gallivm->builder,
> >>>>>>                                emit_data->args[0],
> >>>>>>                                emit_data->args[1], "");
> >>>>>> }
> >>>>>>
> >>>>>> Why is this necessary? lp_build_add will already call
> >>>>>> LLVMBuildFAdd
> >>>>>> internally as appropriate.
> >>>>>>
> >>>>>> Is this because some of the functions in lp_bld_arit.c will
> >>>>>> emit
> >>>>>> x86 intrinsics? If so then a "no-x86-intrinsic" flag in the
> >>>>>> build
> >>>>>> context would achieve the same effect with less code
> >>>>>> duplication.
> >>>>>>
> >>>>>> If possible I'd prefer a single version of these actions. If
> >>>>>> not,
> >>>>>> then I'd prefer have them split: lp_build_action_cpu.c and
> >>>>>> lp_build_action_gpu.
> >>>>>
> >>>>> Yes, this is why a split them up.  I can add that flag and
> >>>>> merge
> >>>>> the
> >>>>> actions together.
> >>>>
> >>>> That would be nice. Thanks.
> >>>
> >>> Tom, actually I've been looking more at the code, thinking about
> >>> this, and I'm not so sure what's best anymore.
> >>>
> >>> I'd appreciate your honest answer: do you think the stuff in
> >>> lp_bld_arit.[ch] of any use for GPUs in general (or AMD's in
> >>> particular), or is it just an hinderance?
> >>>
> >>> As I said before, for CPUs, this abstraction is useful, to allow
> >>> to
> >>> convert TGSI (and other fixed function state) -> fixed point SIMD
> >>> instructions, which yield the highest throughput on CPUs. Because
> >>> LLVM native types are not expressive enough for fixed function,
> >>> etc.
> >>>
> >>> But if this is useless for GPUs (i.e, if LLVM's native types are
> >>> sufficient), then we can make this abstraction a CPU only thing.
> >>>
> >>
> >> I don't think the lp_bld_arit.c functions are really useful for
> >> GPUs,
> >> and I don't rely on any of them in the R600 backend.  Also, I was
> >> looking
> >> through those functions again and the problem is more than just
> >> x86
> >> intrinsics.  Some of them assume vector types, which I don't use
> >> at
> >> all.
> > 
> > Does that mean that the R600 generates/consumes only scalar
> > expressions?
> R600 (HD2xxx) up to Evergreen/Northern Islands (HD6xxx except HD69xx)
> are VLIW5. So that's not exactly scalar but it doesn't quite fit any
> simd vector model (as you can have 5 different instructions per
> instruction slot). (Cayman, aka HD69xx is VLIW4, and the new GCN
> chips
> aka HD7xxx indeed use a scalar model, as does nvidia.).
> The vectors as they are used by llvmpipe are of course there in gpus
> too, but these are really mostly hidden (amd chips generally have a
> logical simd width of 64 and nvidia 32 - amd calls this the wavefront
> size and nvidia the warp size but in any case you still emit scalar
> looking instructions which are really implicit vectors).
> So I guess using explicit vectors isn't really helping matters.
> Maybe with intel gpus it would fit better as they have sort of
> configurable simd width with more control. No idea though if it would
> actually be useful.
> Older chips certainly have some more (AoS) simd aspects to them but
> the
> model doesn't quite fit neither.
> 

I see. Thanks for the explanation Roland.

Jose