[Mesa-dev] software implementation of vulkan for gsoc/evoc

Sat Jun 10 22:24:55 UTC 2017

I know this is an old thread.  I completely missed it the first time, 
but recently rediscovered after reading 
http://www.phoronix.com/scan.php?page=news_item&px=Vulkan-CPU-Repository 
, and perhaps it's not too late for a couple comments FWIW.

On 13/02/17 02:17, Jacob Lifshay wrote:
> forgot to add mesa-dev when I sent.
> ---------- Forwarded message ----------
> From: "Jacob Lifshay" <programmerjake at gmail.com 
> <mailto:programmerjake at gmail.com>>
> Date: Feb 12, 2017 6:16 PM
> Subject: Re: [Mesa-dev] software implementation of vulkan for gsoc/evoc
> To: "Dave Airlie" <airlied at gmail.com <mailto:airlied at gmail.com>>
> Cc:
> 
> 
> 
> On Feb 12, 2017 5:34 PM, "Dave Airlie" <airlied at gmail.com 
> <mailto:airlied at gmail.com>> wrote:
> 
>      > I'm assuming that control barriers in Vulkan are identical to
>     barriers
>      > across a work-group in opencl. I was going to have a work-group
>     be a single
>      > OS thread, with the different work-items mapped to SIMD lanes. If
>     we need to
>      > have additional scheduling, I have written a javascript compiler that
>      > supports generator functions, so I mostly know how to write a
>     llvm pass to
>      > implement that. I was planning on writing the shader compiler
>     using llvm,
>      > using the whole-function-vectorization pass I will write, and
>     using the
>      > pre-existing spir-v to llvm translation layer. I would also write
>     some llvm
>      > passes to translate from texture reads and stuff to basic vector ops.
> 
>     Well the problem is number of work-groups that gets launched could be
>     quite high, and this can cause a large overhead in number of host
>     threads
>     that have to be launched. There was some discussion on this in mesa-dev
>     archives back when I added softpipe compute shaders.
> 
> 
> I would start a thread for each cpu, then have each thread run the 
> compute shader a number of times instead of having a thread per shader 
> invocation.

At least for llvmpipe, last time I looked into this, using OS green 
threads seemed a simple non-intrusive method of dealing with this --
https://lists.freedesktop.org/archives/mesa-dev/2016-April/114790.html
-- but it sounds like LLVM coroutines can handle this more effectively.

> 
> 
>      > I have a prototype rasterizer, however I haven't implemented
>     binning for
>      > triangles yet or implemented interpolation. currently, it can handle
>      > triangles in 3D homogeneous and calculate edge equations.
>      > https://github.com/programmerjake/tiled-renderer
>     <https://github.com/programmerjake/tiled-renderer>
>      > A previous 3d renderer that doesn't implement any vectorization
>     and has
>      > opengl 1.x level functionality:
>      >
>     https://github.com/programmerjake/lib3d/blob/master/softrender.cpp
>     <https://github.com/programmerjake/lib3d/blob/master/softrender.cpp>
> 
>     Well I think we already have a completely fine rasterizer and binning
>     and whatever
>     else in the llvmpipe code base. I'd much rather any Mesa based
>     project doesn't
>     throw all of that away, there is no reason the same swrast backend
>     couldn't
>     be abstracted to be used for both GL and Vulkan and introducing another
>     just because it's interesting isn't a great fit for long term project
>     maintenance..
> 
>     If there are improvements to llvmpipe that need to be made, then that
>     is something
>     to possibly consider, but I'm not sure why a swrast vulkan needs a
>     from scratch
>     raster implemented. For a project that is so large in scope, I'd think
>     reusing that code
>     would be of some use. Since most of the fun stuff is all the texture
>     sampling etc.
> 
> 
> I actually think implementing the rasterization algorithm is the best 
> part. I wanted the rasterization algorithm to be included in the 
> shaders, eg. triangle setup and binning would be tacked on to the end of 
> the vertex shader and parameter interpolation and early z tests would be 
> tacked on to the beginning of the fragment shader and blending on to the 
> end. That way, llvm could do more specialization and instruction 
> scheduling than is possible in llvmpipe now.

Parameter interpolation, early z test, and blending *is* tacked to 
llmvpipe's fragment shaders.

I don't see how to effectively tack triangle setup into the vertex 
shader: vertex shader applies to vertices, where as triangle setup and 
bining applies to primitives.  Usually, each vertex gets transformed 
only once with llvmpipe, no matter how many triangles refer that vertex. 
  The only way to tack triangle setup into vertex shading would be if 
you processed vertices a primitive at a time.  Of course one could put 
an if-statement to skip reprocessing a vertex that already was 
processed, but then you have race conditions, and no benefit of inlining.

And I'm afraid that tacking rasterization too is one those things that 
sound great on paper, quite bad in practice.  And I speak from 
experience: in fact llvmpipe had the last step of rasterization bolted 
on the fragment shaders for some time.  But we took it out because it 
was _slower_.

The issue is that if you bolt on to the shader body, you either:

- inline in the shader body code for the maxmimum number of planes that 
(which are 7, 3 sides of triangle, plus 4 sides of a scissor rect), and 
waste cpu cicles going through all of those tests, even when most of the 
time many of those tests aren't needed

- or you generate if/for blocks for each place, so you only do the 
needed tests, but then you have branch prediction issues...

Whereas if you keep rasterization _outside_ the shader you can have 
specialized functions to do the rasterization based on the primitive 
itself: (is the triangle fully inside the scissor, you need 3 planes, if 
the stamp is fully inside the triangle you need zero).  Essentially you 
can "compose" by coupling two functions calls: you call a rasterization 
function that's especiallized for the primitive, then a shading function 
that's specialized for the state (but not depends on the primitive).

It makes sense: rasterization needs to be specialized for the primitive, 
not the graphics state; where as the shader needs to be specialized for 
the state.

And this is just one of those non-intuitive things that's not obvious 
until one actually does a lot of profiling, a lot of experimentation. 
And trust me, lot of time was spent fine tuning this for llvmpipe (not 
be me -- most of rasterization was done by Keith Whitwell.)  And by 
throwing llvmpipe out of the window and starting a new software 
rendering from scratch you'd be just subscribing to do it all over again.

Whereas if instead of starting from scratch, you take llvmpipe, and you 
rewrite/replace one component at a time, you can reach exactly the same 
destination you want to reach, however you'll have something working 
every step of the way, so when you take a bad step, you can measure 
performance impact, and readjust.  Plus if you run out of time, you have 
something useful -- not yet another half finished project, which quickly 
will rot away.

Regarding generating the spir-v -> scalar llvm, then do whole function 
vectorization, I don't think it's a bad idea per se.  If was I writing 
llvmpipe from scratch today I'd do something like that.  Especially 
because (scalar) LLVM IR is so pervasive in the graphics ecosistem anyway.

It was only after I had tgsi -> llvm ir all done that I stumbled into 
http://compilers.cs.uni-saarland.de/projects/wfv/ .

I think the important thing here is that, once you've vectorized the 
shader, and you converted your "texture_sample" to 
"texture_sample.vector8", and your "output_merger" intrinsics to 
"output_merger.vector8", or you log2/exp2, you then slot the fine tuned 
llvmpipe code for texture sampling and blending and math, as that's were 
your bottle necks tend to be.  Because if you plan to write all texture 
sampling from scratch then you need a time/clone machine to complete 
this in a summer; and if just use LLVM's / standard C runtime's 
sqrt/log2/exp2/sin/cos then it would be dead slow.

Anyway, I hope this helps.  Best of luck.

Jose