<html> <head> <base href="https://bugs.freedesktop.org/" /> </head> <body> <div> <a class="bz_bug_link bz_status_NEW " title="NEW - Implement SSBOs in GLSL front-end and i965" href="https://bugs.freedesktop.org/show_bug.cgi?id=89597#c28">Comment # 28</a> on <a class="bz_bug_link bz_status_NEW " title="NEW - Implement SSBOs in GLSL front-end and i965" href="https://bugs.freedesktop.org/show_bug.cgi?id=89597">bug 89597</a> from <a class="email" href="mailto:itoral@igalia.com" title="Iago Toral <itoral@igalia.com>"> Iago Toral</a> <pre>(In reply to Kristian Høgsberg from <a href="show_bug.cgi?id=89597#c26">comment #26</a>) > (In reply to Iago Toral from <a href="show_bug.cgi?id=89597#c25">comment #25</a>) > > I think the FS bits work well now and I have switched focus to vec4. Here > > the problem is again with non-constant offsets that are not 16-byte aligned > > (this is actually not working in master for UBOs either). I have a working > > solution for this, but I'd like to discuss here the details and see what you > > think about it: > > > > Instead of using scattered messages like we do for writes in the FS I am > > experimenting with unaligned oword block reads in this case. The solution, > > that I describe below, seems to work well but I had to work around a couple > > of issues that maybe can be dealt with in a better way. The solution looks > > like this (only for non-constant offsets, for constant offsets we can just > > use dual oword read): > > > > 1) Since unaligned oword messages are not "dual", we have to handle each of > > the two SIMD4x2 vertices separately, so I emit two separate unaligned reads, > > one from offset.0 (first vertex) and another from offset.4 (second vertex). > > I store the results of these reads to separate virtual registers > > read_result0 and read_result1 respectively. > > > > 2) In the next step I merge both read results into a single register > > suitable for SIMD4x2 operation. That is, if we call dst the destination of > > the SIMD4x2 operation, I move the lower half of read_result0 to the lower > > half of dst and the lower half of read_result1 to the higher half of dst. > > For this part I have defined a generator opcode (let's call it > > simd4x2_merge) that I initially implemented as two mov(4) operations, like > > this: > > > > brw_MOV(p, > > brw_vec4_reg(dst.file, dst.nr, 0), > > brw_vec4_reg(src0.file, src0.nr, 0)); > > brw_MOV(p, > > brw_vec4_reg(dst.file, dst.nr, 4), > > brw_vec4_reg(src0.file, src1.nr, 0)); > > > > The first problem I found with this solution is that in some examples it > > would trigger the following assertion: > > > > brw_vec4_generator.cpp:1927: void brw::vec4_generator::generate_code(const > > cfg_t*): Assertion `p->nr_insn == pre_emit_nr_insn + 1 || !"conditional_mod, > > no_dd_check, or no_dd_clear set for IR " "emitting more than 1 instruction"' > > failed. > > > > This problem comes from the fact that in these cases, > > opt_set_dependency_control would set no_dd_clear/no_dd_check on the > > simd4x2_merge instruction and this does not seem to like the fact that this > > generator opcode actually expands to more than just one assembly > > instruction. I did not see an obvious way to deal with this other than > > skipping opt_set_dependency_control for this generator opcode, since the > > fact that it spawns more than just one assembly instruction will inevitably > > lead to this problem. I imagine that it could be possible to fix this more > > generally by having the generator set dependency control flags on all the > > instructions emitted by the opcode maybe. Anyway, since there is a > > is_dep_ctrl_unsafe() function that seems to be there to select situations > > where we want to avoid opt_set_dependency_control to kick in, I just added > > this opcode there. Is there anything else to this scenario that I am missing? > > > > With that fixed, I found another issue with the register coalesce > > optimization pass, as it attempted to rewrite the simd4x2_merge instruction > > to write directly to an MRF register using a writemask. As I show above, my > > first approach to the generator opcode was to have two MOVs that would > > always write all of the dst register (since that made sense in my context), > > but if other optimization passes can take the liberty to rewrite the > > instruction to introduce a writemask then that approach is no longer valid > > and I have to honor the writemask on the dst. The (small?) problem with that > > is that as far as I know, I cannot do mov(4) operations that honor > > dst.dw1.bits.writemask, so I have to do that manually emitting mov(1) > > operations for each channel enabled. Is there a better way to handle this? > > Would it be better to use the scattered read in vec4 mode as well to load > both vec4s in one operation? Either way, there is no unaligned write > message, so we'd have to use scatter write for writing. I already had some initial work started with unaligned reads for vec4 so that is why I continued with that approach. For writes you are right that we will have to use scattered messages in any case.</pre> </div> <hr> You are receiving this mail because: <ul> <li>You are the QA Contact for the bug.</li> </ul> </body> </html>