<div dir="ltr"><div class="gmail_extra"><div class="gmail_quote">On Mon, Apr 9, 2018 at 5:35 AM, Rob Clark <<a href="mailto:robdclark@gmail.com" target="_blank">robdclark@gmail.com</a>> wrote: <blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">On Mon, Apr 9, 2018 at 1:35 AM, Jason Ekstrand <<a href="mailto:jason@jlekstrand.net">jason@jlekstrand.net</a>> wrote: <div><div class="h5">> Rather lively discussion we've got going here... > > On Sun, Apr 8, 2018 at 3:23 PM, Rob Clark <<a href="mailto:robdclark@gmail.com">robdclark@gmail.com</a>> wrote: >> >> On Sun, Apr 8, 2018 at 5:54 PM, Bas Nieuwenhuizen >> <<a href="mailto:bas@basnieuwenhuizen.nl">bas@basnieuwenhuizen.nl</a>> wrote: >> > On Sun, Apr 8, 2018 at 11:40 PM, Rob Clark <<a href="mailto:robdclark@gmail.com">robdclark@gmail.com</a>> wrote: >> >> On Sun, Apr 8, 2018 at 5:20 PM, Bas Nieuwenhuizen >> >> <<a href="mailto:bas@basnieuwenhuizen.nl">bas@basnieuwenhuizen.nl</a>> wrote: >> >>>>>>>>>>> + >> >>>>>>>>>>> + /** The mode of the underlying variable */ >> >>>>>>>>>>> + nir_variable_mode mode; >> >>>>>>>>>> >> >>>>>>>>>> In fact, it seems like deref->mode is unused outside of >> >>>>>>>>>> nir_print and >> >>>>>>>>>> nir_validate.. for logical addressing we can get the mode from >> >>>>>>>>>> the >> >>>>>>>>>> deref_var->var at the start of the chain, and deref->mode has >> >>>>>>>>>> no >> >>>>>>>>>> meaning for physical addressing (where the mode comes from the >> >>>>>>>>>> pointer). >> >>>>>>>>>> >> >>>>>>>>>> So maybe just drop deref->mode? >> >>>>>>>>> >> >>>>>>>>> Isn't it still useful with logical addressing in case a var is >> >>>>>>>>> not >> >>>>>>>>> immediately available? (think VK_KHR_variable_pointers) > > > Yes, the idea here is to basically be the NIR equivalent of storage classes. > For cases where you can chase it all the way back to the variable, this is > pointless. For cases where you can't, this can be very useful. > >> >> >>>>>>>> not sure, maybe this should just also use fat-pointers like >> >>>>>>>> physical >> >>>>>>>> addressing does?? > > > I *think* (without knowing the future with perfect accuracy) that I'd like > us to start moving away from fat pointers in SPIR-V -> NIR. They've been > working ok thus far but we already have had complaints from the radv guys > that they don't want fat pointers for SLM and I'm not sure there that great > of a fit for other things. The big reason is that it's actually fairly > painful to change from one fat pointer scheme to another if whatever the > spirv_to_nir authors picked doesn't really fit your hardware. > > I'm not 100% sure what to do but the idea I've got at the moment is > something like this: > > 1) For every mode (or storage class), the driver provides to spirv_to_nir a > glsl_type which is the type it wants the pointer represented as. > 2) Add a deref_to_pointer intrinsic to convert the deref to that type and > use casts to convert the fat pointer to a deref again > > Why the deref_to_pointer intrinsic? Because right now some annoying details > about nir_intrinsic_info force us to have all drefs have a single component. > If we didn't, then nir_intrinsic_store_var would have two variable-size > sources which aren't the same size. We could modify the rules and have a > concept of a "deref source" and to nir_intrinsic_info and say that a deref > source can be any size. That would also work and may actually be a better > plan. I'm open to other suggestions as well. > > The key point, however, is (1) where we just let the driver choose the > storage type of pointers and then they can have whatever lowering pass they > want. If that's a "standard" fat-pointers pass in nir_lower_io, so be it. > If they want to translate directly into LLVM derefs, they can do that too, I > suppose. > >> >>>>>>>>> Also I could see this being useful in physical addressing too to >> >>>>>>>>> avoid >> >>>>>>>>> all passes working with derefs needing to do the constant >> >>>>>>>>> folding? > > > Constant folding is cheap, so I'm not too worried about that. The bigger > thing that actual derefs gain us is the ability of the compiler to reason > about derefs. Suppose, for instance, that you had the following: > > struct S { > float a; > float b; > }; > > location(set = 0, binding = 0) Block { > S arr[10]; > } foo; > > void > my_func(int i) > { > foo.arr[i].a = make_a_value(); > foo.arr[i].b = make_another_value(); > use_value(foo.arr[i].a); > } > > If everything is lowered to fat pointers, the compiler can't very easily > tell that the second line of my_func() didn't stomp foo.arr[i].a and so it > has to re-load the value in the third line. With derefs, we can easily see > that the second line didn't stomp it and just re-use the re-use the result > of the make_a_value() call. Yes, you may be able to tell the difference > between foo.arr[i].a and foo.arr[i].b with some fancy analysis and > arithmetic tracking but it'd be very painful to do. I've given this > substantial thought for almost two years now and haven't actually come up > with a good way to do it without the information provided by derefs. > > The biggest thing that I think we will gain from deref instructions isn't > fancy syntax sugar and better nir_printability of pointers. It also isn't > the removal of explicit of fat pointers in spirv_to_nir (thought that is > also nice). The biggest thing I'm going for is improving NIR's ability to > optimize UBO, SSBO, and SLM access. We're pretty good today for local > variables (though there are a couple of known places for improvements) but > we're utterly rubbish for anything that leaves the current shader > invocation. If we want competent compute performance (which we do!) we need > to solve that problem. > >> >> >>>>>>>> The problem is that you don't necessarily know the type at >> >>>>>>>> compile >> >>>>>>>> time (and in the case where you do, you need to do constant >> >>>>>>>> folding to >> >>>>>>>> figure it out) >> >>>>>>> >> >>>>>>> So I have two considerations here >> >>>>>>> >> >>>>>>> 1) for vulkan you always know the mode, even when you don't know >> >>>>>>> the var. >> >>>>>>> 2) In CL the mode can still get annotated in the source program >> >>>>>>> (CL C >> >>>>>>> non-generic pointers) in cases in which we cannot reasonably >> >>>>>>> figure it >> >>>>>>> out with just constant folding. In those cases the mode is extra >> >>>>>>> information that you really lose. >> >>>>>> >> >>>>>> so, even in cl 1.x, you could do things like 'somefxn(foo ? >> >>>>>> global_ptr >> >>>>>> : local_ptr)'.. depending on how much we inline all the things, >> >>>>>> that >> >>>>>> might not get CF'd away. >> >>> >> >>> How does this even work btw? somefxn has a definition, and the >> >>> definition specifies a mode for the argument right? (which is >> >>> implicitly __private if the app does not specify anything?) >> >> >> >> iirc, the cl spec has an example something along these lines.. >> >> >> >> it doesn't require *physical* storage for anything where you don't >> >> know what the ptr type is, however.. so fat ptrs in ssa space works >> >> out >> >> >> >>>>> >> >>>>> But something like >> >>>>> __constant int *ptr_value = ...; >> >>>>> store ptr in complex data structure. >> >>>>> __constant int* ptr2 = load from complex data structure. >> >>>>> >> >>>>> Without explicitly annotating ptr2 it is unlikely that constant >> >>>>> folding would find that ptr2 is pointing to __constant address >> >>>>> space. >> >>>>> Hence removing the modes loses valuable information that you cannot >> >>>>> get back by constant folding. However, if you have a pointer with >> >>>>> unknown mode, we could have a special mode (or mode_all?) and you >> >>>>> can >> >>>>> use the uvec2 representation in that case? > > > I've thought about using a mode of 0 for this or maybe letting you OR all > the possible modes together since nir_variable_mode is, after all, a > bitfield. I don't have any huge preference at the moment. > >> >> >>>> hmm, I'm not really getting how deref->mode could magically have >> >>>> information that fatptr.y doesn't have.. if the mode is known, vtn >> >>>> could stash it in fatptr.y and everyone is happy? If vtn doesn't >> >>>> know >> >>>> this, then I don't see how deref->mode helps.. >> >>> >> >>> You mean insert it into the fatptr every time deref_cast is called? >> >>> >> >>> Wouldn't that blow up the IR size significantly for very little >> >>> benefit? >> >> >> >> in an easy to clean up way, so meh? >> > >> > We can't clean it up if we want to keep the information. Also nir is >> > pretty slow to compile already, so I'd like not to add a significant >> > number of instruction for very little benefit. > > > Really? I mean, I can believe it, but do you have any actual numbers to > back this up? It's considerably faster than some other IRs we have in mesa > though they are known to be pretty big pigs if we're honest. I'm very open > to any suggestions on how to improve compile times. If NIR is fundamentally > a pig, we should fix that. > > I don't think compile time should be the deciding factor in whether or not > we use fat pointers as I doubt it will have a huge effect. However, how > much work it takes to get at information is a reasonable argument. > >> >> I guess I'm failing to see what information you'd loose.. or how there >> would be a problem.. >> >> I'm not strictly against having nir_var_unknown as a possible value >> for deref->mode, but I'm not really seeing how that helps anything, >> other than adding extra complexity to the IR.. >> >> >> >>>>>> >> >>>>>> I think I'm leaning towards using fat ptrs for the vk case, since I >> >>>>>> guess that is a case where you could always expect >> >>>>>> nir_src_as_const_value() to work, to get the variable mode. If for >> >>>>>> no >> >>>>>> other reason than I guess these deref's, if the var is not known, >> >>>>>> start w/ deref_cast, and it would be ugly for deref_cast to have to >> >>>>>> work differently for compute vs vk. But maybe Jason already has >> >>>>>> some >> >>>>>> thoughts about it? > > > I don't see why they would be different. Vulkan and CL are starting to > converge. In vulkan, you already can't always chase the pointer back to the > variable VK_KHR_variable_pointers is used. I don't think that CL is as > special as you think it is. The only thing special about CL is the fact > that global and local are in the same address space. Beyond that, all the > problems we have to solve are fundamentally the same. > >> >> >>>>> I'd like to avoid fat pointers alltogether on AMD since we would not >> >>>>> use it even for CL. a generic pointer is just a uint64_t for us, >> >>>>> with >> >>>>> no bitfield in there for the address space. > > > I tend to agree with this. (See above discussion). > >> >> >>>>> I think we may need to think a bit more about representation >> >>>>> however, >> >>>>> as e.g. for AMD a pointer is typically 64-bits (but we can do e.g. >> >>>>> 32-bits for known workgroup pointers), the current deref >> >>>>> instructions >> >>>>> return 32-bit, and you want something like a uvec2 as pointer >> >>>>> representation? >> >>>> >> >>>> afaiu, newer AMD (and NV) hw can remap shared/private into a single >> >>>> global address space.. But I guess that is an easy subset of the >> >>>> harder case where drivers need to use different instructions.. so a >> >>>> pretty simple lowering pass run before lower_io could remap things >> >>>> that use fatptrs into something that ignores fatptr.y. Then opt >> >>>> passes make fatptr.y go away. So both AMD and hw that doesn't have a >> >>>> flat address space are happy. >> >>> >> >>> But then you run into other issues, like how are you going to stuff a >> >>> 64-bit fatptr.x + a ?-bit fatptr.y into a 64-bit value for Physical64 >> >>> addressing? Also this means we have to track to the sources back to >> >>> the cast/var any time we want to do anything at all with any deref >> >>> which seems less efficient to me than just stuffing the deref in >> >>> there. >> >> >> >> so fat ptrs only have to exist in ssa space, not be stored to >> >> something with a physically defined size.. >> > >> > how does storing __generic pointers work then? those still need the >> > fat bit for your hw right? >> >> for hw that can't map everything into a single flat address space, you >> *can't* store a fat pointer.. oh well, it doesn't mean that that hw >> can't implement lesser ocl versions (since this is defn not required >> for ocl 1.x and I don't *think* it is required for 2.x, but I haven't >> checked the spec.. I guess if intel supports 2.x then it isn't >> required..) > > > You can still carve up the address space and put local memory at some > absurdly high address and do > > if (ptr > 0xffffffffffff0000) > store_local(ptr & 0xffff) > else > store_global(ptr) > </div></div>That might be an option if we need to physically store a fat pointer (but I'm not convinced we need to). But as a general solution it is a horrible idea. You either loose information (since when was TMI a problem for compilers?) or you get to teach constant folding that even though it doesn't know the value of 'a' it does know the value of 'a & 0xffffffffffff0000', which is insanity. And simply not optimizing away the global/local/private turns every load/store into something that looks like: <a href="http://cmps.f.lt" rel="noreferrer" target="_blank">cmps.f.lt</a> p0.x, c4.x, r0.z ; delay 6 instructions so branch can see the value in p0.x (rpt5)nop br !p0.x, #3 ldg.u32 r0.w, r0.y, 1 jump #11 (jp)<a href="http://cmps.f.lt" rel="noreferrer" target="_blank">cmps.f.lt</a> p0.x, c4.y, r0.z (rpt5)nop ; delay 6 instruction slots br !p0.x, #8 ldl.u32 r0.w, r0.y, 1 jump #3 ; private is really just global with a driver provided buffer ; and compiler computed offset (jp)add.s r0.y, c10.x ; add base address of buffer (rpt3)nop ; alu results not immediately avail and no other instr to fill the empty slots add.s r0.y, ... offset calculated from local_id (rpt5)nop ; 6 slots for result from alu avail to mem instructions ldg.u32 r0.w, r0.y, 1 (jp)(sy)... and now we have a result.. which is simply not an option! And the result is actually worse since we end up with 64b pointer math lowered to 32b instructions! tbh, I *really* don't see the problem with fat pointers, but if someone else doesn't want deref_cast to use fat pointers, then I must insist on a different intrinsic which does, and a compiler option for vtn to choose which to emit. <div class="HOEnZb"><div class="h5"></div></div></blockquote><div> </div><div>I'm sorry. That comment was NOT intended as a suggestion for universal alternative to fat pointers. It was merely to demonstrate how an architecture like Intel's which uses different messages for local vs. global could implement physical 64-bit (uint64_t) generic pointers. Someone (looks like Rob from the indentation) made a comment about difficulties implementing OpenCL 2.x and I was just trying to show that it could be done. Of course, stuffing the extra information in a vector component is easier. This was a side comment to what I thought was a side comment in the discussion. </div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div class="HOEnZb"><div class="h5"> >> >> >> >> >> As far as tracking things to the head of the chain of deref >> >> instructions, that is a matter of a simple helper or two. Not like >> >> the chain of deref's is going to be 1000's of instructions.. > > > Yes and no (mostly no) once you start throwing in phis, selects, and loading > pointers from variables. The first two you can explain away by just saying, > "follow the phi/select sources" but the moment you load a pointer out of a > variable, you have to do something. Sure, you could just drop .y and stomp > it to the known storage class, but that hardly seems ideal. > >> >> >>> Also, what would the something which ignores fatptr.y be? I'd assume >> >>> that would be the normal deref based stuff, but requiring fatptr >> >>> contradicts that? >> >> >> >> if you have a flat address space, maybe a pass (or option for >> >> lower_io) to just convert everything to load/store_global (since >> >> essentially what these GPUs are doing is remapping shared/private into >> >> the global address space) >> > >> > but I'd like to only do that if we really don't know the mode >> > statically since it is somewhat slower/less flexible. (also radv is >> > unlikely to use nir_lower_io for a lot of this stuff since we can >> > lower derefs into LLVM GEPs directly) >> >> I mean, we control all the src code, we can add more options to lower_io. >> >> > >> > Hence if we want the cases where we know the mode statically we need >> > to not lower the fatptr, but then we have the whole fatptr mess. >> > >> >> well, fatptr mess is unavoidable.. I mostly just fail to see how a >> more general solution (ie. storing the variable mode in ssa) is worse >> than having to deal with both cases (in ssa or in instr). The in ssa >> case is something that is easy to recover since anything that does >> pointer math has to split out the .y component and fuse it back in >> after the pointer math. (And if you have a flat address space, I >> guess I'm failing to see why you care about the mode in the first >> place.) > > > Yes, fat pointers are inevitable for some hardware at some stage in the > compile. That does not, however, mean that we want fat pointers straight > out of SPIR-V. </div></div></blockquote></div> </div></div>