<div dir="auto">forgot to add mesa-dev when I sent (again).</div><div class="gmail_quote">---------- Forwarded message ----------<br>From: "Jacob Lifshay" <<a href="mailto:programmerjake@gmail.com">programmerjake@gmail.com</a>><br>Date: Feb 13, 2017 8:27 AM<br>Subject: Re: [Mesa-dev] software implementation of vulkan for gsoc/evoc<br>To: "Nicolai Hähnle" <<a href="mailto:nhaehnle@gmail.com">nhaehnle@gmail.com</a>><br>Cc: <br><br type="attribution"><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="auto"><div><br><div class="gmail_extra"><br><div class="gmail_quote">On Feb 13, 2017 7:54 AM, "Nicolai Hähnle" <<a href="mailto:nhaehnle@gmail.com" target="_blank">nhaehnle@gmail.com</a>> wrote:<br type="attribution"><blockquote class="m_-5562941407544249920quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div class="m_-5562941407544249920quoted-text">On 13.02.2017 03:17, Jacob Lifshay wrote:<br> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div class="m_-5562941407544249920quoted-text"> On Feb 12, 2017 5:34 PM, "Dave Airlie" <<a href="mailto:airlied@gmail.com" target="_blank">airlied@gmail.com</a><br></div><div class="m_-5562941407544249920quoted-text"> <mailto:<a href="mailto:airlied@gmail.com" target="_blank">airlied@gmail.com</a>>> wrote:<br> <br> > I'm assuming that control barriers in Vulkan are identical to barriers<br> > across a work-group in opencl. I was going to have a work-group be<br> a single<br> > OS thread, with the different work-items mapped to SIMD lanes. If<br> we need to<br> > have additional scheduling, I have written a javascript compiler that<br> > supports generator functions, so I mostly know how to write a llvm<br> pass to<br> > implement that. I was planning on writing the shader compiler<br> using llvm,<br> > using the whole-function-vectorization pass I will write, and<br> using the<br> > pre-existing spir-v to llvm translation layer. I would also write<br> some llvm<br> > passes to translate from texture reads and stuff to basic vector ops.<br> <br> Well the problem is number of work-groups that gets launched could be<br> quite high, and this can cause a large overhead in number of host<br> threads<br> that have to be launched. There was some discussion on this in mesa-dev<br> archives back when I added softpipe compute shaders.<br> <br> <br> I would start a thread for each cpu, then have each thread run the<br> compute shader a number of times instead of having a thread per shader<br> invocation.<br> </div></blockquote> <br> This will not work.<br> <br> Please, read again what the barrier() instruction does: When the barrier() call is reached, _all_ threads within the workgroup are supposed to be run until they reach that barrier() call.</blockquote></div></div></div><div dir="auto"><br></div><div dir="auto">to clarify, I had meant that each os thread would run the sections of the shader between the barriers for all the shaders in a work group, then, when it finished the work group, it would go to the next work group assigned to the os thread.</div><div dir="auto"><br></div><div dir="auto">so, if our shader is:</div><div dir="auto">a = b + tid;</div><div dir="auto">barrier();</div><div dir="auto">d = e + f;</div><div dir="auto"><br></div><div dir="auto">and our simd width is 4, our work-group size is 128, and we have 16 os threads, then it will run for each os thread:</div><div dir="auto">for(workgroup = os_thread_index; workgroup < workgroup_count; <span style="font-family:sans-serif">workgroup++)</span></div><div dir="auto"><span style="font-family:sans-serif">{</span></div><div dir="auto"> for(tid_in_workgroup = 0; <span style="font-family:sans-serif">tid_in_workgroup</span> < 128; <span style="font-family:sans-serif">tid_in_workgroup</span> += 4)</div><div dir="auto"> {</div><div dir="auto"> ivec4 tid = ivec4(0, 1, 2, 3) + ivec4(tid_in_workgroup + workgroup * 128);</div><div dir="auto"> a[<span style="font-family:sans-serif">tid_in_workgroup</span> / 4] = ivec_add(b[<span style="font-family:sans-serif">tid_in_workgroup</span> / 4], tid<span style="font-family:sans-serif">);</span></div><div dir="auto"> }</div><div dir="auto"> memory_fence(); // if needed</div><div dir="auto"><div dir="auto" style="font-family:sans-serif"> for(tid_in_workgroup = 0; tid_in_workgroup < 128; tid_in_workgroup += 4)</div><div dir="auto" style="font-family:sans-serif"> {</div><div dir="auto" style="font-family:sans-serif"> d[tid_in_workgroup / 4] = vec_add(e[tid_in_workgroup / 4], f[tid_in_workgroup / 4]);</div><div dir="auto" style="font-family:sans-serif"> }</div>}</div><div dir="auto">// after this, we run the next rendering or compute job</div><div dir="auto"><br></div><div dir="auto"><div class="gmail_extra"><div class="gmail_quote"><blockquote class="m_-5562941407544249920quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div class="m_-5562941407544249920elided-text"> <br> > I have a prototype rasterizer, however I haven't implemented<br> binning for<br> > triangles yet or implemented interpolation. currently, it can handle<br> > triangles in 3D homogeneous and calculate edge equations.<br> > <a href="https://github.com/programmerjake/tiled-renderer" rel="noreferrer" target="_blank">https://github.com/programmerj<wbr>ake/tiled-renderer</a><br> <<a href="https://github.com/programmerjake/tiled-renderer" rel="noreferrer" target="_blank">https://github.com/programmer<wbr>jake/tiled-renderer</a>><br> > A previous 3d renderer that doesn't implement any vectorization<br> and has<br> > opengl 1.x level functionality:<br> > <a href="https://github.com/programmerjake/lib3d/blob/master/softrender.cpp" rel="noreferrer" target="_blank">https://github.com/programmerj<wbr>ake/lib3d/blob/master/softrend<wbr>er.cpp</a><br> <<a href="https://github.com/programmerjake/lib3d/blob/master/softrender.cpp" rel="noreferrer" target="_blank">https://github.com/programmer<wbr>jake/lib3d/blob/master/softren<wbr>der.cpp</a>><br> <br> Well I think we already have a completely fine rasterizer and binning<br> and whatever<br> else in the llvmpipe code base. I'd much rather any Mesa based<br> project doesn't<br> throw all of that away, there is no reason the same swrast backend<br> couldn't<br> be abstracted to be used for both GL and Vulkan and introducing another<br> just because it's interesting isn't a great fit for long term project<br> maintenance..<br> <br> If there are improvements to llvmpipe that need to be made, then that<br> is something<br> to possibly consider, but I'm not sure why a swrast vulkan needs a<br> from scratch<br> raster implemented. For a project that is so large in scope, I'd think<br> reusing that code<br> would be of some use. Since most of the fun stuff is all the texture<br> sampling etc.<br> <br> <br> I actually think implementing the rasterization algorithm is the best<br> part. I wanted the rasterization algorithm to be included in the<br> shaders, eg. triangle setup and binning would be tacked on to the end of<br> the vertex shader and parameter interpolation and early z tests would be<br> tacked on to the beginning of the fragment shader and blending on to the<br> end. That way, llvm could do more specialization and instruction<br> scheduling than is possible in llvmpipe now.<br> <br> so the tile rendering function would essentially be:<br> <br> for(i = 0; i < triangle_count; i+= vector_width)<br> jit_functions[i](tile_x, tile_y, &triangle_setup_results[i]);<br> <br> as opposed to the current llvmpipe code where there is a large amount of<br> fixed code that isn't optimized with the shaders.<br> <br> <br> > The scope that I intended to complete is the bare minimum to be vulkan<br> > conformant (i.e. no tessellation and no geometry shaders), so<br> implementing a<br> > loadable ICD for linux and windows that implements a single queue,<br> vertex,<br> > fragment, and compute shaders, implementing events, semaphores,<br> and fences,<br> > implementing images with the minimum requirements, supporting a<br> f32 depth<br> > buffer or a f24 with 8bit stencil, and supporting a<br> yet-to-be-determined<br> > compressed format. For the image optimal layouts, I will probably<br> use the<br> > same chunked layout I use in<br> ><br> <a href="https://github.com/programmerjake/tiled-renderer/blob/master2/image.h#L59" rel="noreferrer" target="_blank">https://github.com/programmerj<wbr>ake/tiled-renderer/blob/master<wbr>2/image.h#L59</a><br> <<a href="https://github.com/programmerjake/tiled-renderer/blob/master2/image.h#L59" rel="noreferrer" target="_blank">https://github.com/programmer<wbr>jake/tiled-renderer/blob/maste<wbr>r2/image.h#L59</a>><br> ,<br> > where I have a linear array of chunks where each chunk has a<br> linear array of<br> > texels. If you think that's too big, we could leave out all of the<br> image<br> > formats except the two depth-stencil formats, the 8-bit and 32-bit<br> integer<br> > and 32-bit float formats.<br> ><br> <br> Seems like a quite large scope, possibly a bit big for a GSoC though,<br> esp one that<br> intends to not use any existing Mesa code.<br> <br> <br> most of the vulkan functions have a simple implementation when we don't<br> need to worry about building stuff for a gpu and synchronization<br> (because we have only one queue), and llvm implements most of the rest<br> of the needed functionality. If we leave out most of the image formats,<br> that would probably cut the amount of code by a third.<br> <br> <br> Dave.<br> <br> <br> <br> <br></div> ______________________________<wbr>_________________<br> mesa-dev mailing list<br> <a href="mailto:mesa-dev@lists.freedesktop.org" target="_blank">mesa-dev@lists.freedesktop.org</a><br> <a href="https://lists.freedesktop.org/mailman/listinfo/mesa-dev" rel="noreferrer" target="_blank">https://lists.freedesktop.org/<wbr>mailman/listinfo/mesa-dev</a><br> <br> </blockquote> <br> </blockquote></div><br></div></div></div> </blockquote></div>