[Mesa-dev] [PATCH 3/3] gallivm: add integer and unsigned mod arit functions.

Wed Feb 22 10:49:44 PST 2012

On Mon, Feb 20, 2012 at 01:50:43PM -0800, Jose Fonseca wrote:
> 
> 
> ----- Original Message -----
> > 
> > 
> > ----- Original Message -----
> > > On Sat, Feb 18, 2012 at 4:20 AM, Jose Fonseca <jfonseca at vmware.com>
> > > wrote:
> > > > ----- Original Message -----
> > > >> On Fri, Feb 17, 2012 at 9:46 PM, Jose Fonseca
> > > >> <jfonseca at vmware.com>
> > > >> wrote:
> > > >> > Dave,
> > > >> >
> > > >> > Ideally there should be only one lp_build_mod() which will
> > > >> > invoke
> > > >> > LLVMBuildSRem or LLVMBuildURem depending on the value of
> > > >> > bld->type.sign.  The point being that this allows the same
> > > >> > code
> > > >> > generation logic to seemingly target any type without having
> > > >> > to
> > > >> > worry too much which target it is targeting.
> > > >>
> > > >> Yeah I agree with this for now, but I'm starting to think a lot
> > > >> of
> > > >> this stuff is redunant once I looked at what Tom has done.
> > > >>
> > > >> The thing is TGSI doesn't have that many crazy options where you
> > > >> are
> > > >> going to be targetting instructions at the wrong type, and
> > > >> wrapping
> > > >> all the basic llvm interfaces with an extra type layer seems to
> > > >> me
> > > >> long term like a waste of time.
> > > >
> > > > So far llvmpipe's TGSI->LLVM IR has only been targetting floating
> > > > point SIMD instructions.
> > > >
> > > > But truth is that many simple fragment shaders can be partially
> > > > done with 8bit and 16bit SIMD integers, if values are represented
> > > > in 8bit unorm and 16 bit unorms. The throughput for these will be
> > > > much higher, as not only we can squeeze more elements, they take
> > > > less cycles, and the hardware has several arithmetic units.
> > > >
> > > > The point of those lp_build_xxx functions is to handle this
> > > > transparently. See, e.g., how lp_build_mul handles fixed point.
> > > > Currently this is only used for blending, but the hope is to
> > > > eventually use it on TGSI translation of simple fragment shaders.
> > > >
> > > > Maybe not the case for the desktop GPUs, but I also heard that
> > > > some
> > > > low powered devices have shader engines w/ 8bit unorms.
> > > >
> > > > But of course, not all opcodes can be done correctly: and
> > > > URem/SRem
> > > > might not be one we care.
> > > >
> > > >> I'm happy for now to finish the integer support in the same
> > > >> style
> > > >> as
> > > >> the current code, but I think moving forward afterwards it might
> > > >> be
> > > >> worth investigating a more direct instruction emission scheme.
> > > >
> > > > If you wanna invoke LLVMBuildURem/LLVMBuildSRem directly from
> > > > tgsi
> > > > translation I'm fine with it. We can always generalize
> > > >
> > > >> Perhaps
> > > >> Tom can comment also from his experience.
> > > >
> > > > BTW, Tom, I just now noticed that there are two action versions
> > > > for
> > > > add:
> > > >
> > > > /* TGSI_OPCODE_ADD (CPU Only) */
> > > > static void
> > > > add_emit_cpu(
> > > >   const struct lp_build_tgsi_action * action,
> > > >   struct lp_build_tgsi_context * bld_base,
> > > >   struct lp_build_emit_data * emit_data)
> > > > {
> > > >   emit_data->output[emit_data->chan] =
> > > >   lp_build_add(&bld_base->base,
> > > >                                   emit_data->args[0],
> > > >                                   emit_data->args[1]);
> > > > }
> > > >
> > > > /* TGSI_OPCODE_ADD */
> > > > static void
> > > > add_emit(
> > > >   const struct lp_build_tgsi_action * action,
> > > >   struct lp_build_tgsi_context * bld_base,
> > > >   struct lp_build_emit_data * emit_data)
> > > > {
> > > >   emit_data->output[emit_data->chan] = LLVMBuildFAdd(
> > > >                                bld_base->base.gallivm->builder,
> > > >                                emit_data->args[0],
> > > >                                emit_data->args[1], "");
> > > > }
> > > >
> > > > Why is this necessary? lp_build_add will already call
> > > > LLVMBuildFAdd
> > > > internally as appropriate.
> > > >
> > > > Is this because some of the functions in lp_bld_arit.c will emit
> > > > x86 intrinsics? If so then a "no-x86-intrinsic" flag in the build
> > > > context would achieve the same effect with less code duplication.
> > > >
> > > > If possible I'd prefer a single version of these actions. If not,
> > > > then I'd prefer have them split: lp_build_action_cpu.c and
> > > > lp_build_action_gpu.
> > > 
> > > Yes, this is why a split them up.  I can add that flag and merge
> > > the
> > > actions together.
> > 
> > That would be nice. Thanks.
> 
> Tom, actually I've been looking more at the code, thinking about this, and I'm not so sure what's best anymore.
> 
> I'd appreciate your honest answer: do you think the stuff in lp_bld_arit.[ch] of any use for GPUs in general (or AMD's in particular), or is it just an hinderance?
> 
> As I said before, for CPUs, this abstraction is useful, to allow to convert TGSI (and other fixed function state) -> fixed point SIMD instructions, which yield the highest throughput on CPUs. Because LLVM native types are not expressive enough for fixed function, etc.
> 
> But if this is useless for GPUs (i.e, if LLVM's native types are sufficient), then we can make this abstraction a CPU only thing.
> 

I don't think the lp_bld_arit.c functions are really useful for GPUs,
and I don't rely on any of them in the R600 backend.  Also, I was looking
through those functions again and the problem is more than just x86
intrinsics.  Some of them assume vector types, which I don't use at all.
So, maybe it is best to keep them separate.

-Tom

> Jose
>