<div dir="ltr"><div class="gmail_extra"><div class="gmail_quote">On Wed, Nov 29, 2017 at 6:50 PM, Jose Maria Casanova Crespo <span dir="ltr"><<a href="mailto:jmcasanova@igalia.com" target="_blank">jmcasanova@igalia.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">load_ubo is using 32-bit loads as uniforms surfaces have a 32-bit<br> surface format defined. So when reading 16-bit components with the<br> sampler we need to unshuffle two 16-bit components from each 32-bit<br> component.<br> <br> Using the sampler avoids the use of the byte_scattered_read message<br> that needs one message for each component and is supposed to be<br> slower.<br> <br> In the case of SKL+ we take advantage of a hardware feature that<br> automatically defines a channel mask based on the rlen value, so on<br> SKL+ we only use half of the registers without using a header in the<br> payload.<br> ---<br> src/intel/compiler/brw_fs.<wbr>cpp | 31 +++++++++++++++++++++++++++---<wbr>-<br> src/intel/compiler/brw_fs_gen<wbr>erator.cpp | 10 ++++++++--<br> 2 files changed, 35 insertions(+), 6 deletions(-)<br> <br> diff --git a/src/intel/compiler/brw_fs.cp<wbr>p b/src/intel/compiler/brw_fs.cp<wbr>p<br> index 1ca4d416b2..9c543496ba 100644<br> --- a/src/intel/compiler/brw_fs.cp<wbr>p<br> +++ b/src/intel/compiler/brw_fs.cp<wbr>p<br> @@ -184,9 +184,17 @@ fs_visitor::VARYING_PULL_CONST<wbr>ANT_LOAD(const fs_builder &bld,<br> * a double this means we are only loading 2 elements worth of data.<br> * We also want to use a 32-bit data type for the dst of the load operation<br> * so other parts of the driver don't get confused about the size of the<br> - * result.<br> + * result. On the case of 16-bit data we only need half of the 32-bit<br> + * components on SKL+ as we take advance of using message return size to<br> + * define an xy channel mask.<br> */<br> - fs_reg vec4_result = bld.vgrf(BRW_REGISTER_TYPE_F, 4);<br> + fs_reg vec4_result;<br> + if (type_sz(dst.type) == 2 && (devinfo->gen >= 9)) {<br> + vec4_result = bld.vgrf(BRW_REGISTER_TYPE_F, 2);<br> + vec4_result = retype(vec4_result, BRW_REGISTER_TYPE_HF);<br> + } else {<br> + vec4_result = bld.vgrf(BRW_REGISTER_TYPE_F, 4);<br> + } <br></blockquote><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"> fs_inst *inst = bld.emit(FS_OPCODE_VARYING_PUL<wbr>L_CONSTANT_LOAD_LOGICAL,<br> vec4_result, surf_index, vec4_offset);<br> inst->size_written = 4 * vec4_result.component_size(ins<wbr>t->exec_size);<br> @@ -197,8 +205,23 @@ fs_visitor::VARYING_PULL_CONST<wbr>ANT_LOAD(const fs_builder &bld,<br> }<br> <br> vec4_result.type = dst.type;<br> - bld.MOV(dst, offset(vec4_result, bld,<br> - (const_offset & 0xf) / type_sz(vec4_result.type)));<br> +<br> + if (type_sz(dst.type) == 2) {<br> + /* 16-bit types need to be unshuffled as each pair of 16-bit components<br> + * is packed on a 32-bit compoment because we are using a 32-bit format<br> + * in the surface of uniform that is read by the sampler.<br> + * TODO: On BDW+ mark when an uniform has 16-bit type so we could setup a<br> + * surface format of 16-bit and use the 16-bit return format at the<br> + * sampler.<br> + */<br> + vec4_result.stride = 2;<br> + bld.MOV(dst, byte_offset(offset(vec4_result<wbr>, bld,<br> + (const_offset & 0x7) / 4),<br> + (const_offset & 0x7) / 2 % 2 * 2));<br> + } else {<br> + bld.MOV(dst, offset(vec4_result, bld,<br> + (const_offset & 0xf) / type_sz(vec4_result.type)));<br> + }<br></blockquote><div><br></div><div>This seems overly complicated. How about something like</div><div><br></div><div>fs_reg dw = offset(vec4_result, bld, (const_offset & 0xf) / 4);<br></div><div>switch (type_sz(dst.type)) {</div><div>case 2:</div><div> shuffle_32bit_load_result_to_16bit_data(bld, dst, dw, 1);</div><div> bld.MOV(dst, subscript(dw, dst.type, (const_offset / 2) & 1));</div><div> break;<br></div><div>case 4:</div><div> bld.MOV(dst, dw);<br></div><div> break;<br></div><div>case 8:</div><div> shuffle_32bit_load_result_to_64bit_data(bld, dst, dw, 1);</div><div> break;</div><div>default:</div><div> unreachable();<br></div><div>}<br></div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"> }<br> <br> /**<br> diff --git a/src/intel/compiler/brw_fs_ge<wbr>nerator.cpp b/src/intel/compiler/brw_fs_ge<wbr>nerator.cpp<br> index a3861cd68e..00a4e29147 100644<br> --- a/src/intel/compiler/brw_fs_ge<wbr>nerator.cpp<br> +++ b/src/intel/compiler/brw_fs_ge<wbr>nerator.cpp<br> @@ -1381,12 +1381,18 @@ fs_generator::generate_varying<wbr>_pull_constant_load_gen7(fs_<wbr>inst *inst,<br> uint32_t simd_mode, rlen, mlen;<br> if (inst->exec_size == 16) {<br> mlen = 2;<br> - rlen = 8;<br> + if (type_sz(dst.type) == 2 && (devinfo->gen >= 9))<br> + rlen = 4;<br> + else<br> + rlen = 8;<br></blockquote><div><br></div><div>I'm not sure what I think of this. We intentionally use a vec4 today instead of a single float because it lets us potentially batch up a bunch of loads in a single texture operation. Are we sure that we never need more than 2 of those result registers?</div><div><br></div><div>--Jason<br></div><div><br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"> simd_mode = BRW_SAMPLER_SIMD_MODE_SIMD16;<br> } else {<br> assert(inst->exec_size == 8);<br> mlen = 1;<br> - rlen = 4;<br> + if (type_sz(dst.type) == 2 && (devinfo->gen >= 9))<br> + rlen = 2;<br> + else<br> + rlen = 4;<br> simd_mode = BRW_SAMPLER_SIMD_MODE_SIMD8;<br> }<br> <span class="gmail-m_265234056147124101HOEnZb"><font color="#888888"><br> --<br> 2.14.3<br> <br> ______________________________<wbr>_________________<br> mesa-dev mailing list<br> <a href="mailto:mesa-dev@lists.freedesktop.org" target="_blank">mesa-dev@lists.freedesktop.org</a><br> <a href="https://lists.freedesktop.org/mailman/listinfo/mesa-dev" rel="noreferrer" target="_blank">https://lists.freedesktop.org/<wbr>mailman/listinfo/mesa-dev</a><br> </font></span></blockquote></div><br></div></div>