<br><br>On Tuesday, January 14, 2020, Jacob Lifshay <<a href="mailto:programmerjake@gmail.com">programmerjake@gmail.com</a>> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">On Mon, Jan 13, 2020 at 9:39 AM Jason Ekstrand <<a href="mailto:jason@jlekstrand.net">jason@jlekstrand.net</a>> wrote:<br> ><br> > On Mon, Jan 13, 2020 at 11:27 AM Luke Kenneth Casson Leighton <<a href="mailto:lkcl@lkcl.net">lkcl@lkcl.net</a>> wrote:<br> >> jason i'd be interested to hear your thoughts on what jacob wrote, does it alleviate your concerns, (we're not designing hardware specifically around vec2/3/4, it simply has that capability).<br> ><br> ><br> > Not at all. If you just want a SW renderer that runs on RISC-V, feel free to write one. </blockquote><div><br></div><div>as we know, it would be embarrassingly low performance, not commercially viable, therefore, logically, we can rule that out as an option to pursue :)</div><div><br></div><div>i don't know if you're aware of Jeff Bush's work on Nyuzi? he set out to duplicate the work of the Intel Larrabee team (a software only GPU experiment), in an academic way (i.e publishing everything, no matter how "bad")</div><div><br></div><div>Jeff sought an answer to the question as to, ahem, why the Larrabee team were not, ahem, "permitted" to publish GPU benchmarks for their work, despite it having high end supercomputer-grade Vector Processing capability.</div><div><br></div><div>i spent several months in discussion with him, i really enjoyed the conversations. we established that if you were to deploy a *standard* Vector Processor General Purpose ISA and engine (Nyuzi, Cray, MMX/SSE/AVX, RISCV RVV), with *zero* special custom hardware for 3D (so, no custom texturisation, no custom z buffers, no special tiled memory or associated pixel opcodes) the performance/watt that you would get would be a QUARTER of current commercial GPUs.</div><div><br></div><div>in other words you need either four times the silicon (four times the power consumption) just to be on par with current commercial GPUs, or you have to sell (only if completely delusional) something that's 25% the performance.</div><div><br></div><div>therefore, we have learned from that lesson, and will not be following that exact route either :)</div><div><br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"> If you want to vectorize in hardware and actually get serious performance out of it, I highly doubt his plan will work. That said, I wasn't planning to work on it so none of this is my problem so you're welcome to take or leave anything I say. :-)</blockquote><div><br></div><div>:)</div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"> So, since it may not have been clearly explained before, the GPU we're<br> building has masked vectorization like most other GPUs, it just that<br> it additionally supports the masked vectors' elements being 1 to 4<br> element subvectors.</blockquote><div><br></div><div>further: this is based on RVV (RISCV Vectors) which in turn is based on the Cray Vector system.</div><div><br></div><div>the plan is to *begin* from this base, and, following the strategy that's documented in Jeff Bush's 2016 paper, assess performance based on pixels/clock and also, again, following Jeff's work, keep a Seriously Close Eye on the power consumption.</div><div><br></div><div>(we've already added 128 registers, for example, because on GPU workloads, which are heavily LD-compute-ST on discontiguous memory areas, you absolutely cannot afford the power penalty of swapping out large numbers of registers through the L1/L2 cache barrier)</div><div><br></div><div>Jeff's strategies we will use as *iterative* guides to making improvements, just ad he did. he actually went through seven different designs (maybe 8 if you include the ChiselGPU triangle raster engine he wrote)</div><div><br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"> If it turns out that using subvectors makes the GPU slower, we can add<br> a scalarization pass before the SIMT to vector translation, converting<br> everything to using more conventional operations.</blockquote><div><br></div><div>yes, exactly. and that would be one of the kinds of tasks for which the NLNet funding is available.</div><div><br></div><div>so that would be one very good example of something that would be assessed using Jeff Bush's methodology.</div><div><br></div><div>what's nice about this is: it's literally an opportunity for a Software Engineer working on MESA to, instead of saying "damnit these hardware engineers really messed up, i feel totally powerless to fix it", to say "this isn't good enough! i need instruction X to get better performance!" and instead of saying "sorry we taped out already, deal with it, derwood" we go, "okay, great, give us 2 weeks and you can test out a new instruction X. start writing code to use it!"</div><div><br></div><div>i know that there is someone out there who, on reading this, is going to go "cool! and the actual hardware's libre too, and.. wait... i get money for this???"</div><div><br></div><div>:)</div><div><br></div><div>so again, jason, i'd like to emphasise again just how grateful i am that you raised the issue of subvectors, because now we can put it on the list of things to watch out for and experiment with.</div><div><br></div><div>and, just to be clear: we've already had this iterative approach approved by NLNet: to start from a known-good (highly suboptimal but Vulkan Compliant) driver and to experiment with designs (hopefully not at the microarchitectural level) and instructions (a lot) and change the ISA (hopefully not a lot), to, over time, reach commercially acceptable performance.</div><div><br></div><div>and it's entirely libre. paid...and libre. who knew _that_ would ever happen in the GPU world?</div><div><br></div><div>l.</div><div><br></div><div><br></div><br><br>-- <br>---<br>crowd-funded eco-conscious hardware: <a href="https://www.crowdsupply.com/eoma68" target="_blank">https://www.crowdsupply.com/eoma68</a><br><br>