<div dir="ltr">Thank you for the review Ilia!<br><br><div class="gmail_quote"><div dir="ltr">On Fri, 1 Jun 2018 at 23:44, Ilia Mirkin <<a href="mailto:imirkin@alum.mit.edu">imirkin@alum.mit.edu</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">On Fri, Jun 1, 2018 at 6:21 PM, Plamena Manolova<br>
<<a href="mailto:plamena.n.manolova@gmail.com" target="_blank">plamena.n.manolova@gmail.com</a>> wrote:<br>
> This patch adds the implentation of ARB_compute_variable_group_size<br>
> for i965. We do this by storing the group size in a buffer surface,<br>
> similarly to the work group number.<br>
><br>
> Signed-off-by: Plamena Manolova <<a href="mailto:plamena.n.manolova@gmail.com" target="_blank">plamena.n.manolova@gmail.com</a>><br>
> ---<br>
>  docs/features.txt                                |  2 +-<br>
>  docs/relnotes/18.2.0.html                        |  1 +<br>
>  src/compiler/nir/nir_lower_system_values.c       | 14 ++++<br>
>  src/intel/compiler/brw_compiler.h                |  2 +<br>
>  src/intel/compiler/brw_fs.cpp                    | 45 ++++++++----<br>
>  src/intel/compiler/brw_fs_nir.cpp                | 20 ++++++<br>
>  src/intel/compiler/brw_nir_lower_cs_intrinsics.c | 87 +++++++++++++++++-------<br>
>  src/mesa/drivers/dri/i965/brw_compute.c          | 25 ++++++-<br>
>  src/mesa/drivers/dri/i965/brw_context.h          |  1 +<br>
>  src/mesa/drivers/dri/i965/brw_cs.c               |  4 ++<br>
>  src/mesa/drivers/dri/i965/brw_wm_surface_state.c | 27 +++++++-<br>
>  src/mesa/drivers/dri/i965/intel_extensions.c     |  1 +<br>
>  12 files changed, 187 insertions(+), 42 deletions(-)<br>
><br>
> diff --git a/docs/features.txt b/docs/features.txt<br>
> index ed4050cf98..7c3c856d73 100644<br>
> --- a/docs/features.txt<br>
> +++ b/docs/features.txt<br>
> @@ -298,7 +298,7 @@ Khronos, ARB, and OES extensions that are not part of any OpenGL or OpenGL ES ve<br>
><br>
>    GL_ARB_bindless_texture                               DONE (nvc0, radeonsi)<br>
>    GL_ARB_cl_event                                       not started<br>
> -  GL_ARB_compute_variable_group_size                    DONE (nvc0, radeonsi)<br>
> +  GL_ARB_compute_variable_group_size                    DONE (nvc0, radeonsi, i965)<br>
>    GL_ARB_ES3_2_compatibility                            DONE (i965/gen8+)<br>
>    GL_ARB_fragment_shader_interlock                      DONE (i965)<br>
>    GL_ARB_gpu_shader_int64                               DONE (i965/gen8+, nvc0, radeonsi, softpipe, llvmpipe)<br>
> diff --git a/docs/relnotes/18.2.0.html b/docs/relnotes/18.2.0.html<br>
> index a3f44a29dc..4ceeb7471f 100644<br>
> --- a/docs/relnotes/18.2.0.html<br>
> +++ b/docs/relnotes/18.2.0.html<br>
> @@ -45,6 +45,7 @@ Note: some of the new features are only available with certain drivers.<br>
><br>
>  <ul><br>
>  <li>GL_ARB_fragment_shader_interlock on i965</li><br>
> +<li>GL_ARB_compute_variable_group_size on i965</li><br>
>  </ul><br>
><br>
>  <h2>Bug fixes</h2><br>
> diff --git a/src/compiler/nir/nir_lower_system_values.c b/src/compiler/nir/nir_lower_system_values.c<br>
> index 487da04262..0af6d69426 100644<br>
> --- a/src/compiler/nir/nir_lower_system_values.c<br>
> +++ b/src/compiler/nir/nir_lower_system_values.c<br>
> @@ -57,6 +57,15 @@ convert_block(nir_block *block, nir_builder *b)<br>
>            *    gl_WorkGroupID * gl_WorkGroupSize + gl_LocalInvocationID"<br>
>            */<br>
><br>
> +<br>
> +          /*<br>
> +           * If the local work group size is variable we can't lower the global<br>
> +           * invocation id here.<br>
> +           */<br>
> +          if (b->shader->info.cs.local_size_variable) {<br>
> +             break;<br>
> +          }<br>
> +<br>
<br>
There appears to be some tabs vs spaces thing here.<br>
<br>
>           nir_const_value local_size;<br>
>           memset(&local_size, 0, sizeof(local_size));<br>
>           local_size.u32[0] = b->shader->info.cs.local_size[0];<br>
> @@ -102,6 +111,11 @@ convert_block(nir_block *block, nir_builder *b)<br>
>        }<br>
><br>
>        case SYSTEM_VALUE_LOCAL_GROUP_SIZE: {<br>
> +         /* If the local work group size is variable we can't lower it here */<br>
> +         if (b->shader->info.cs.local_size_variable) {<br>
> +            break;<br>
> +         }<br>
> +<br>
>           nir_const_value local_size;<br>
>           memset(&local_size, 0, sizeof(local_size));<br>
>           local_size.u32[0] = b->shader->info.cs.local_size[0];<br>
> diff --git a/src/intel/compiler/brw_compiler.h b/src/intel/compiler/brw_compiler.h<br>
> index 8b4e6fe2e2..f54952c28f 100644<br>
> --- a/src/intel/compiler/brw_compiler.h<br>
> +++ b/src/intel/compiler/brw_compiler.h<br>
> @@ -759,6 +759,7 @@ struct brw_cs_prog_data {<br>
>     unsigned threads;<br>
>     bool uses_barrier;<br>
>     bool uses_num_work_groups;<br>
> +   bool uses_variable_group_size;<br>
><br>
>     struct {<br>
>        struct brw_push_const_block cross_thread;<br>
> @@ -771,6 +772,7 @@ struct brw_cs_prog_data {<br>
>         * surface indices the CS-specific surfaces<br>
>         */<br>
>        uint32_t work_groups_start;<br>
> +      uint32_t work_group_size_start;<br>
>        /** @} */<br>
>     } binding_table;<br>
>  };<br>
> diff --git a/src/intel/compiler/brw_fs.cpp b/src/intel/compiler/brw_fs.cpp<br>
> index d67c0a4192..28730af47b 100644<br>
> --- a/src/intel/compiler/brw_fs.cpp<br>
> +++ b/src/intel/compiler/brw_fs.cpp<br>
> @@ -7228,18 +7228,32 @@ brw_compile_cs(const struct brw_compiler *compiler, void *log_data,<br>
>                 int shader_time_index,<br>
>                 char **error_str)<br>
>  {<br>
> -   prog_data->local_size[0] = src_shader->info.cs.local_size[0];<br>
> -   prog_data->local_size[1] = src_shader->info.cs.local_size[1];<br>
> -   prog_data->local_size[2] = src_shader->info.cs.local_size[2];<br>
> -   unsigned local_workgroup_size =<br>
> -      src_shader->info.cs.local_size[0] * src_shader->info.cs.local_size[1] *<br>
> -      src_shader->info.cs.local_size[2];<br>
> -<br>
> -   unsigned min_dispatch_width =<br>
> -      DIV_ROUND_UP(local_workgroup_size, compiler->devinfo->max_cs_threads);<br>
> -   min_dispatch_width = MAX2(8, min_dispatch_width);<br>
> -   min_dispatch_width = util_next_power_of_two(min_dispatch_width);<br>
> -   assert(min_dispatch_width <= 32);<br>
> +   unsigned min_dispatch_width;<br>
> +<br>
> +   if (!src_shader->info.cs.local_size_variable) {<br>
> +      unsigned local_workgroup_size =<br>
> +         src_shader->info.cs.local_size[0] * src_shader->info.cs.local_size[1] *<br>
> +         src_shader->info.cs.local_size[2];<br>
> +<br>
> +      min_dispatch_width =<br>
> +         DIV_ROUND_UP(local_workgroup_size, compiler->devinfo->max_cs_threads);<br>
> +      min_dispatch_width = MAX2(8, min_dispatch_width);<br>
> +      min_dispatch_width = util_next_power_of_two(min_dispatch_width);<br>
> +      assert(min_dispatch_width <= 32);<br>
> +<br>
> +      prog_data->local_size[0] = src_shader->info.cs.local_size[0];<br>
> +      prog_data->local_size[1] = src_shader->info.cs.local_size[1];<br>
> +      prog_data->local_size[2] = src_shader->info.cs.local_size[2];<br>
> +      prog_data->uses_variable_group_size = false;<br>
> +   } else {<br>
> +      /*<br>
> +       * If the local work group size is variable we have to use a dispatch<br>
> +       * width of 32 here, since at this point we don't know the actual size of<br>
> +       * the workload.<br>
> +       */<br>
> +      min_dispatch_width = 32;<br>
<br>
Is that a good idea? You are able to specify a different maximum when<br>
using a variable size (MAX_COMPUTE_VARIABLE_GROUP_INVOCATIONS_ARB)<br>
s.t. this is 16 (or even 8, although that may be too few for practical<br>
use) -- that way you would just set the max to 768 or whatever on<br>
gen8+.<br></blockquote><div><br></div><div>That's a good point, MAX_COMPUTE_VARIABLE_GROUP_INVOCATIONS_ARB is the</div><div>same on all platforms, so it makes sense to use simd16 instead. Thank you for noticing that.</div><div><br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
<br>
> +      prog_data->uses_variable_group_size = true;<br>
> +   }<br>
><br>
>     fs_visitor *v8 = NULL, *v16 = NULL, *v32 = NULL;<br>
>     cfg_t *cfg = NULL;<br>
<br>
As for the rest of it, I don't know enough, but you seem to be doing a<br>
lot of divisions and mods in the shader. These tend to be expensive<br>
ops -- I wonder if there's a way to alleviate some of that.<br></blockquote><div><br></div><div>That's true, unfortunately I think doing these calculations in the shader is</div><div>necessary. They all use the local group size which, when it's variable, is not</div><div>available until the dispatch command is issued, I couldn't think of a way around</div><div>that :(</div><div><br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
<br>
  -ilia<br>
</blockquote></div></div>