[Mesa-dev] [PATCH 2/3] gallivm: add fp64 support.

Mon Jun 29 16:36:49 PDT 2015

Am 29.06.2015 um 22:18 schrieb Dave Airlie:
> On 30 June 2015 at 00:58, Roland Scheidegger <sroland at vmware.com> wrote:
>> Don't worry about the AoS stuff. Only meant to do simple things.
>>
>> Looks good overall, I guess it makes sense to not split execution too
>> (so you'd have native hw vector size there), llvm should handle that
>> pretty well these days (the sse intrinsics won't get used that way
>> probably (though there's a helper for that too which makes it possible
>> but it might not be hooked up, but I guess there's not really much need
>> for them).
>>
>> Some comments inline.
> 
> I've noticed we have no tests for indirect access to fp64 things, so
> I'll probably write some first to validate the indirect paths I
> haven't fixed up yet.
Ok, thanks for looking at that.

> 
>>> Two things that don't mix well are SoA and doubles, see
>>> emit_fetch_double, and emit_store_double_chan in this.
>>>
>>> I've also had to split emit_data.chan, to add src_chan,
>>> which can be different for doubles.
>>>
>>> Open issues:
>>> are intrinsics okay for floor/ceil?
>> The question is if they actually work if you don't have sse4.1 and don't
>> just crash (at least I assume with sse4.1 it turns into round
>> instruction). (Or on non-x86 cpus if there is no direct hw support). If
>> they don't you'd have to provide your own implementation (at least as a
>> fallback) or make support for the extension conditional. Otherwise llvm
>> intrinsics are just fine (traditionally we didn't really use them much
>> as most of the things we do with sse intrinsics were missing, and even
>> if some intrinsic existed it often didn't work, but that was a long time
>> ago - ideally we'd switch to llvm intrinsics where possible).
> 
> Okay well I'm okay with limiting fp64 to where they work I suppose
> though that needs
> testing on older non sse4.1 hw.
It ought to be possible to tell llvm to not use some features by telling
it the cpu it should use. We currently never do that though.
You can also run in a intel software development environment (sde)
though it's incredibly slow. Ideally, these intrinsics would work
regardless of cpu features of course but I'm not so sure about it...


> 
>>> +
>>> +      scalar = LLVMBuildExtractElement(builder, input, si, "");
>>> +      res = LLVMBuildInsertElement(builder, res, scalar, ii, "");
>>> +      scalar2 = LLVMBuildExtractElement(builder, input2, si, "");
>>> +      res = LLVMBuildInsertElement(builder, res, scalar2, ii1, "");
>>> +   }
>> Did you check what code this generated? Traditionally, we tried to avoid
>> the extract/insert stuff where possible and use shuffles instead.
>> Because llvm would actually do inserts/extracts (i.e. move from simd
>> domain to integer domain and back, which is pretty horrendous, and
>> doubly so on some non-intel cpus which have like 15+ cycles latency for
>> this). It is possible though this is no longer a problem, llvm 3.6 or
>> 3.7 got some majorly improved shuffle optimizer which might also catch this.
> 
> No I haven't looked at what it generated, I was pretty sure it was
> going to be ugly,
> 
> Oh if I can use shufflevector for this direction I probably will, that
> make sense. I'm not sure it'll work for the other way,
> but maybe two shufflevectors will, I hadn't looked into it that much yet.
> 
There's lots of shuffle helpers (build_interleave and such). We needed
to be careful what shuffles we used as llvm didn't have a good handle on
it (so we actually used multiple shuffles in series to mirror what sse
could do, instead of using a single one), however with the new shuffle
optimizer for x86 (which landed in 3.6 I think) even that should not be
necessary anymore - you can just construct any shuffle you want and llvm
should pretty much optimize it optimally (and even if that's like 4
shuffles, that's still an order of magnitude better than a dozen or more
inserts/extracts). But like I said, it is possible it even handles
inserts/extracts in some more useful way nowadays (though I'd guess it
would be easier for the optimization passes if you'd started with a
single shuffle instead of tons of inserts/extracts even in this case) -
a quick look at the assembly for any example should tell you that.

Roland