<div dir="auto">forgot to add mesa-dev when I sent (again).</div><div class="gmail_quote">---------- Forwarded message ----------<br>From: "Jacob Lifshay" <<a href="mailto:programmerjake@gmail.com">programmerjake@gmail.com</a>><br>Date: Feb 13, 2017 8:27 AM<br>Subject: Re: [Mesa-dev] software implementation of vulkan for gsoc/evoc<br>To: "Nicolai Hähnle" <<a href="mailto:nhaehnle@gmail.com">nhaehnle@gmail.com</a>><br>Cc: <br><br type="attribution"><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="auto"><div><br><div class="gmail_extra"><br><div class="gmail_quote">On Feb 13, 2017 7:54 AM, "Nicolai Hähnle" <<a href="mailto:nhaehnle@gmail.com" target="_blank">nhaehnle@gmail.com</a>> wrote:<br type="attribution"><blockquote class="m_-5562941407544249920quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div class="m_-5562941407544249920quoted-text">On 13.02.2017 03:17, Jacob Lifshay wrote:<br>
</div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div class="m_-5562941407544249920quoted-text">
On Feb 12, 2017 5:34 PM, "Dave Airlie" <<a href="mailto:airlied@gmail.com" target="_blank">airlied@gmail.com</a><br></div><div class="m_-5562941407544249920quoted-text">
<mailto:<a href="mailto:airlied@gmail.com" target="_blank">airlied@gmail.com</a>>> wrote:<br>
<br>
> I'm assuming that control barriers in Vulkan are identical to barriers<br>
> across a work-group in opencl. I was going to have a work-group be<br>
a single<br>
> OS thread, with the different work-items mapped to SIMD lanes. If<br>
we need to<br>
> have additional scheduling, I have written a javascript compiler that<br>
> supports generator functions, so I mostly know how to write a llvm<br>
pass to<br>
> implement that. I was planning on writing the shader compiler<br>
using llvm,<br>
> using the whole-function-vectorization pass I will write, and<br>
using the<br>
> pre-existing spir-v to llvm translation layer. I would also write<br>
some llvm<br>
> passes to translate from texture reads and stuff to basic vector ops.<br>
<br>
Well the problem is number of work-groups that gets launched could be<br>
quite high, and this can cause a large overhead in number of host<br>
threads<br>
that have to be launched. There was some discussion on this in mesa-dev<br>
archives back when I added softpipe compute shaders.<br>
<br>
<br>
I would start a thread for each cpu, then have each thread run the<br>
compute shader a number of times instead of having a thread per shader<br>
invocation.<br>
</div></blockquote>
<br>
This will not work.<br>
<br>
Please, read again what the barrier() instruction does: When the barrier() call is reached, _all_ threads within the workgroup are supposed to be run until they reach that barrier() call.</blockquote></div></div></div><div dir="auto"><br></div><div dir="auto">to clarify, I had meant that each os thread would run the sections of the shader between the barriers for all the shaders in a work group, then, when it finished the work group, it would go to the next work group assigned to the os thread.</div><div dir="auto"><br></div><div dir="auto">so, if our shader is:</div><div dir="auto">a = b + tid;</div><div dir="auto">barrier();</div><div dir="auto">d = e + f;</div><div dir="auto"><br></div><div dir="auto">and our simd width is 4, our work-group size is 128, and we have 16 os threads, then it will run for each os thread:</div><div dir="auto">for(workgroup = os_thread_index; workgroup < workgroup_count; <span style="font-family:sans-serif">workgroup++)</span></div><div dir="auto"><span style="font-family:sans-serif">{</span></div><div dir="auto"> for(tid_in_workgroup = 0; <span style="font-family:sans-serif">tid_in_workgroup</span> < 128; <span style="font-family:sans-serif">tid_in_workgroup</span> += 4)</div><div dir="auto"> {</div><div dir="auto"> ivec4 tid = ivec4(0, 1, 2, 3) + ivec4(tid_in_workgroup + workgroup * 128);</div><div dir="auto"> a[<span style="font-family:sans-serif">tid_in_workgroup</span> / 4] = ivec_add(b[<span style="font-family:sans-serif">tid_in_workgroup</span> / 4], tid<span style="font-family:sans-serif">);</span></div><div dir="auto"> }</div><div dir="auto"> memory_fence(); // if needed</div><div dir="auto"><div dir="auto" style="font-family:sans-serif"> for(tid_in_workgroup = 0; tid_in_workgroup < 128; tid_in_workgroup += 4)</div><div dir="auto" style="font-family:sans-serif"> {</div><div dir="auto" style="font-family:sans-serif"> d[tid_in_workgroup / 4] = vec_add(e[tid_in_workgroup / 4], f[tid_in_workgroup / 4]);</div><div dir="auto" style="font-family:sans-serif"> }</div>}</div><div dir="auto">// after this, we run the next rendering or compute job</div><div dir="auto"><br></div><div dir="auto"><div class="gmail_extra"><div class="gmail_quote"><blockquote class="m_-5562941407544249920quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div class="m_-5562941407544249920elided-text">
<br>
> I have a prototype rasterizer, however I haven't implemented<br>
binning for<br>
> triangles yet or implemented interpolation. currently, it can handle<br>
> triangles in 3D homogeneous and calculate edge equations.<br>
> <a href="https://github.com/programmerjake/tiled-renderer" rel="noreferrer" target="_blank">https://github.com/programmerj<wbr>ake/tiled-renderer</a><br>
<<a href="https://github.com/programmerjake/tiled-renderer" rel="noreferrer" target="_blank">https://github.com/programmer<wbr>jake/tiled-renderer</a>><br>
> A previous 3d renderer that doesn't implement any vectorization<br>
and has<br>
> opengl 1.x level functionality:<br>
> <a href="https://github.com/programmerjake/lib3d/blob/master/softrender.cpp" rel="noreferrer" target="_blank">https://github.com/programmerj<wbr>ake/lib3d/blob/master/softrend<wbr>er.cpp</a><br>
<<a href="https://github.com/programmerjake/lib3d/blob/master/softrender.cpp" rel="noreferrer" target="_blank">https://github.com/programmer<wbr>jake/lib3d/blob/master/softren<wbr>der.cpp</a>><br>
<br>
Well I think we already have a completely fine rasterizer and binning<br>
and whatever<br>
else in the llvmpipe code base. I'd much rather any Mesa based<br>
project doesn't<br>
throw all of that away, there is no reason the same swrast backend<br>
couldn't<br>
be abstracted to be used for both GL and Vulkan and introducing another<br>
just because it's interesting isn't a great fit for long term project<br>
maintenance..<br>
<br>
If there are improvements to llvmpipe that need to be made, then that<br>
is something<br>
to possibly consider, but I'm not sure why a swrast vulkan needs a<br>
from scratch<br>
raster implemented. For a project that is so large in scope, I'd think<br>
reusing that code<br>
would be of some use. Since most of the fun stuff is all the texture<br>
sampling etc.<br>
<br>
<br>
I actually think implementing the rasterization algorithm is the best<br>
part. I wanted the rasterization algorithm to be included in the<br>
shaders, eg. triangle setup and binning would be tacked on to the end of<br>
the vertex shader and parameter interpolation and early z tests would be<br>
tacked on to the beginning of the fragment shader and blending on to the<br>
end. That way, llvm could do more specialization and instruction<br>
scheduling than is possible in llvmpipe now.<br>
<br>
so the tile rendering function would essentially be:<br>
<br>
for(i = 0; i < triangle_count; i+= vector_width)<br>
jit_functions[i](tile_x, tile_y, &triangle_setup_results[i]);<br>
<br>
as opposed to the current llvmpipe code where there is a large amount of<br>
fixed code that isn't optimized with the shaders.<br>
<br>
<br>
> The scope that I intended to complete is the bare minimum to be vulkan<br>
> conformant (i.e. no tessellation and no geometry shaders), so<br>
implementing a<br>
> loadable ICD for linux and windows that implements a single queue,<br>
vertex,<br>
> fragment, and compute shaders, implementing events, semaphores,<br>
and fences,<br>
> implementing images with the minimum requirements, supporting a<br>
f32 depth<br>
> buffer or a f24 with 8bit stencil, and supporting a<br>
yet-to-be-determined<br>
> compressed format. For the image optimal layouts, I will probably<br>
use the<br>
> same chunked layout I use in<br>
><br>
<a href="https://github.com/programmerjake/tiled-renderer/blob/master2/image.h#L59" rel="noreferrer" target="_blank">https://github.com/programmerj<wbr>ake/tiled-renderer/blob/master<wbr>2/image.h#L59</a><br>
<<a href="https://github.com/programmerjake/tiled-renderer/blob/master2/image.h#L59" rel="noreferrer" target="_blank">https://github.com/programmer<wbr>jake/tiled-renderer/blob/maste<wbr>r2/image.h#L59</a>><br>
,<br>
> where I have a linear array of chunks where each chunk has a<br>
linear array of<br>
> texels. If you think that's too big, we could leave out all of the<br>
image<br>
> formats except the two depth-stencil formats, the 8-bit and 32-bit<br>
integer<br>
> and 32-bit float formats.<br>
><br>
<br>
Seems like a quite large scope, possibly a bit big for a GSoC though,<br>
esp one that<br>
intends to not use any existing Mesa code.<br>
<br>
<br>
most of the vulkan functions have a simple implementation when we don't<br>
need to worry about building stuff for a gpu and synchronization<br>
(because we have only one queue), and llvm implements most of the rest<br>
of the needed functionality. If we leave out most of the image formats,<br>
that would probably cut the amount of code by a third.<br>
<br>
<br>
Dave.<br>
<br>
<br>
<br>
<br></div>
______________________________<wbr>_________________<br>
mesa-dev mailing list<br>
<a href="mailto:mesa-dev@lists.freedesktop.org" target="_blank">mesa-dev@lists.freedesktop.org</a><br>
<a href="https://lists.freedesktop.org/mailman/listinfo/mesa-dev" rel="noreferrer" target="_blank">https://lists.freedesktop.org/<wbr>mailman/listinfo/mesa-dev</a><br>
<br>
</blockquote>
<br>
</blockquote></div><br></div></div></div>
</blockquote></div>