[Nouveau] Some llvm questions (for tgsi backend)

Tue Jan 12 06:53:10 PST 2016

Hi Tom,

Thanks for taking the time to answer this.

On 11-01-16 18:10, Tom Stellard wrote:
> On Mon, Jan 11, 2016 at 12:07:14PM +0100, Hans de Goede wrote:
>> Hi,
>>
>> After a few distractions I'm back to work on the llvm tgsi backend. I've
>> added clang integration and I can now compile a simple opencl program
>> to something which sort of looks like tgsi.
>>
>> You can find my latest work on this here:
>> http://cgit.freedesktop.org/~jwrdegoede/llvm
>> http://cgit.freedesktop.org/~jwrdegoede/clang
>> (the latter may still need to sync)
>>
>> I've a little test program of which I have 3 versions now,
>> 1 raw gallium calls + a tgsi kernel
>> 2 opencl calls to clover + a tgsi kernel
>> 3 opencl calls to clover + an opencl kernel
>>
>> 1 and 2 have been tested on a kepler card, 3 has been
>> tested with pocl. My goal for this week is to get
>> the tgsi backend to produce code which I can copy
>> and paste into 2 and then have it working on a kepler card.
>>
>> The test program looks like this:
>>
>> __kernel void test_kern(__global uint *vals, __global uint *buf)
>> {
>>    uint id = get_global_id(0);
>>
>>    buf[32 * id] -= vals[id];
>> }
>>
>> The llvm ir looks like this:
>>
>> bin/clang -x cl -c -emit-llvm -target tgsi-- -include /usr/share/pocl/include/_kernel.h -o ~/foo.ir -x cl -S ~/foo.cl
>>
>> ; ModuleID = '/home/hans/foo.cl'
>> target datalayout = "E-p:32:32-i64:64:64-f32:32:32-n32"
>> target triple = "tgsi--"
>>
>> ; Function Attrs: nounwind
>> define void @test_kern(i32 addrspace(1)* nocapture readonly %vals, i32 addrspace(1)* nocapture %buf) #0 {
>> entry:
>>    %call = tail call i32 @_Z13get_global_idj(i32 0) #2
>>    %arrayidx = getelementptr inbounds i32, i32 addrspace(1)* %vals, i32 %call
>>    %0 = load i32, i32 addrspace(1)* %arrayidx, align 4, !tbaa !7
>>    %mul = shl i32 %call, 5
>>    %arrayidx1 = getelementptr inbounds i32, i32 addrspace(1)* %buf, i32 %mul
>>    %1 = load i32, i32 addrspace(1)* %arrayidx1, align 4, !tbaa !7
>>    %sub = sub i32 %1, %0
>>    store i32 %sub, i32 addrspace(1)* %arrayidx1, align 4, !tbaa !7
>>    ret void
>> }
>>
>> declare i32 @_Z13get_global_idj(i32) #1
>>
>> attributes #0 = { nounwind "disable-tail-calls"="false" "less-precise-fpmad"="false" "no-frame-pointer-elim"="true" "no-frame-pointer-elim-non-leaf" "no-infs-fp-math"="false" "no-nans-fp-math"="false" "stack-protector-buffer-size"="8" "unsafe-fp-math"="false" "use-soft-float"="false" }
>> attributes #1 = { "disable-tail-calls"="false" "less-precise-fpmad"="false" "no-frame-pointer-elim"="true" "no-frame-pointer-elim-non-leaf" "no-infs-fp-math"="false" "no-nans-fp-math"="false" "stack-protector-buffer-size"="8" "unsafe-fp-math"="false" "use-soft-float"="false" }
>> attributes #2 = { nounwind }
>>
>> !opencl.kernels = !{!0}
>> !llvm.ident = !{!6}
>>
>> !0 = !{void (i32 addrspace(1)*, i32 addrspace(1)*)* @test_kern, !1, !2, !3, !4, !5}
>> !1 = !{!"kernel_arg_addr_space", i32 1, i32 1}
>> !2 = !{!"kernel_arg_access_qual", !"none", !"none"}
>> !3 = !{!"kernel_arg_type", !"uint*", !"uint*"}
>> !4 = !{!"kernel_arg_base_type", !"uint*", !"uint*"}
>> !5 = !{!"kernel_arg_type_qual", !"", !""}
>> !6 = !{!"clang version 3.8.0 (http://llvm.org/git/clang.git 9376f992e00569bd08a4ecf3a1d06d8b93c97681) (http://llvm.org/git/llvm.git 7a311143550c6fc01aa5000049825ecc09787440)"}
>> !7 = !{!8, !8, i64 0}
>> !8 = !{!"int", !9, i64 0}
>> !9 = !{!"omnipotent char", !10, i64 0}
>> !10 = !{!"Simple C/C++ TBAA"}
>>
>> And the "tgsi" looks like this:
>>
>>          .text
>>          .file   "/home/hans/foo.cl"
>>          .globl  test_kern
>> test_kern:
>>          BGNSUB
>>          MOVis TEMP1x, 0
>>          CAL _Z13get_global_idj
>>          SHLs TEMP1y, TEMP1x, 7
>>          LOADiis TEMP1z, [4]
>>          UADDs TEMP1y, TEMP1z, TEMP1y
>>          SHLs TEMP1x, TEMP1x, 2
>>          LOADiis TEMP1z, [0]
>>          UADDs TEMP1x, TEMP1z, TEMP1x
>>          LOADgis TEMP1x, [TEMP1x]
>>          INEGs TEMP1x, TEMP1x
>>          LOADgis TEMP1z, [TEMP1y]
>>          UADDs TEMP1x, TEMP1x, TEMP1z
>>          STOREgis [TEMP1y], TEMP1x
>>          RET
>>          ENDSUB
>>
>> Working tgsi for this would look like this:
>>
>> COMP
>> DCL SV[0], THREAD_ID[0]
>> DCL TEMP[0], LOCAL
>> DCL TEMP[1], LOCAL
>> IMM UINT32 { 0, 0, 0, 0 }
>> IMM UINT32 { 4, 0, 0, 0 }
>> IMM UINT32 { 128, 0, 0, 0 }
>>
>> BGNSUB
>>         LOAD TEMP[0].xy, RINPUT, IMM[0]
>>         UMUL TEMP[1].x, SV[0], IMM[1]
>>         UADD TEMP[0].x, TEMP[0], TEMP[1]
>>         UMUL TEMP[1].x, SV[0], IMM[2]
>>         UADD TEMP[0].y, TEMP[0], TEMP[1].xxxx
>>         LOAD TEMP[1].x, RGLOBAL, TEMP[0]
>>         LOAD TEMP[0].x, RGLOBAL, TEMP[0].yyyy
>>         UADD TEMP[1].x, TEMP[0], -TEMP[1]
>>         STORE RGLOBAL.x, TEMP[0].yyyy, TEMP[1]
>>         RET
>> ENDSUB;
>>
>> So my questions (I'm still quite green when it comes to llvm):
>>
>> 1) As you can see a proper tgsi program needs a header
>> to declare which registers (etc) it is using, in which
>> class-method should I implement this ?
>>
>
> These kinds of things should be emitted by the TGSI implementation of the
> AsmPrinter class: http://llvm.org/docs/doxygen/html/classllvm_1_1AsmPrinter.html
> You probably want to use EmitStartOfAsmFile() for the headers.

Ok, is there an easy way to suppress the generation of:

         .text
         .file   "/home/hans/foo.cl"
         .globl  test_kern

?

It seems I need to tackle all 3 of those separately ?

>> 2) Immediates need to be declared with a specific
>> value and then addressed as IMM[x], how would I go about
>> this ?
>>
>
> I would recommend adding a pass that replaced immediates in the code with
> IMM file regitser accesses.

Thinking more about this, I believe I can use the existing const
mechanism for this (and then add a const generation pass to AsmPrinter).

How do I tell llvm to never generate instructions using immediates
and to instead always use the const address space for this ?

>> 3) The get_global_id call needs to be translated into
>> simply using the SV[0] "register", how would I go about
>> this ?
>>
>
> get_global_id should be implemented in libclc with tgsi specific intrinsics
> that read from the system value registers.

I see I will investigate this further.

>> 4) The global and input load / stores are not handled
>> correctly, I see that the LOAD instructions get postfixed
>> with a i reps. g for input / global how would I go about
>> modifying the code emitter (AsmPrinter?) to change "LOADi"
>> into "LOAD <dest> RINPUT <offset>"?
>>
>
> I don't understand exactly why you want to change this.  Are you trying to
> make the assembly string for input loads and global loads look the same?

Currently the following code is generated:

LOADiis TEMP[1].z, [0]

This should be:

LOAD TEMP[1].z, RINPUT, 0

So I need to add handling to generate the RINPUT here. I'm wondering
what the best place is for this. Do I simply intercept the LOAD in
LowerMachineInstrToMCInst and add an extra operand there ?

>> 5) Talking about the lowecase suffixes to the instructions,
>> these should not be part of the output, how do I filter these?
>>
>> 6) And finally, the current llvm-tgsi output uses e.g.
>> TEMP1y where as for the destination it should use TEMP[1].y
>> and for the sources it should use TEMP[1].xxxx (so include
>> proper swizzling info).
>>
>
> You just need to change how the registers are printed in the
> TGSIRegisterInfo.td file which should fix this.

OK, so I've managed (easy) to change things so that now I get
(i.e.) :

UADDs TEMP[1].x, TEMP[1].x, TEMP[1].z

This is however not correct TGSI, correct would be:

UADD TEMP[1].x, TEMP[1].xxxx, TEMP[1].zzzz

Which may be shortend to:

UADD TEMP[1].x, TEMP[1], TEMP[1].zzzz

TGSI will automatically use the dest vector component from
the sources, and if you want to use another component you
need to specify a swizzle order, which is the 4 components
in an alternate order.

I'm not quite sure yet how to deal with this.

Regards,

Hans