<div dir="ltr"><div class="gmail_quote"><div dir="ltr">On Tue, Jan 15, 2019 at 3:54 PM Axel Davy <<a href="mailto:davyaxel0@gmail.com">davyaxel0@gmail.com</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">On 15/01/2019 18:50, Marek Olšák wrote:<br>
> <br>
> +void si_compute_copy_image(struct si_context *sctx,<br>
> + struct pipe_resource *dst,<br>
> + unsigned dst_level,<br>
> + struct pipe_resource *src,<br>
> + unsigned src_level,<br>
> + unsigned dstx, unsigned dsty, unsigned dstz,<br>
> + const struct pipe_box *src_box)<br>
> +{<br>
> + struct pipe_context *ctx = &sctx->b;<br>
> + unsigned width = src_box->width;<br>
> + unsigned height = src_box->height;<br>
> + unsigned depth = src_box->depth;<br>
> +<br>
> + unsigned data[] = {src_box->x, src_box->y, src_box->z, 0, dstx, dsty, dstz, 0};<br>
> +<br>
> + if (width == 0 || height == 0)<br>
> + return;<br>
> +<br>
> + sctx->flags |= SI_CONTEXT_CS_PARTIAL_FLUSH |<br>
> + si_get_flush_flags(sctx, SI_COHERENCY_SHADER, L2_STREAM);<br>
> + si_make_CB_shader_coherent(sctx, dst->nr_samples, true);<br>
> +<br>
> + struct pipe_constant_buffer saved_cb = {};<br>
> + si_get_pipe_constant_buffer(sctx, PIPE_SHADER_COMPUTE, 0, &saved_cb);<br>
> +<br>
> + struct si_images *images = &sctx->images[PIPE_SHADER_COMPUTE];<br>
> + struct pipe_image_view saved_image[2] = {0};<br>
> + util_copy_image_view(&saved_image[0], &images->views[0]);<br>
> + util_copy_image_view(&saved_image[1], &images->views[1]);<br>
> +<br>
> + void *saved_cs = sctx->cs_shader_state.program;<br>
> +<br>
> + struct pipe_constant_buffer cb = {};<br>
> + cb.buffer_size = sizeof(data);<br>
> + cb.user_buffer = data;<br>
> + ctx->set_constant_buffer(ctx, PIPE_SHADER_COMPUTE, 0, &cb);<br>
> +<br>
> + struct pipe_image_view image[2] = {0};<br>
> + image[0].resource = src;<br>
> + image[0].shader_access = image[0].access = PIPE_IMAGE_ACCESS_READ;<br>
> + image[0].format = util_format_linear(src->format);<br>
> + image[0].u.tex.level = src_level;<br>
> + image[0].u.tex.first_layer = 0;<br>
> + image[0].u.tex.last_layer =<br>
> + src->target == PIPE_TEXTURE_3D ? u_minify(src->depth0, src_level) - 1<br>
> + : (unsigned)(src->array_size - 1);<br>
> + image[1].resource = dst;<br>
> + image[1].shader_access = image[1].access = PIPE_IMAGE_ACCESS_WRITE;<br>
> + image[1].format = util_format_linear(dst->format);<br>
> + image[1].u.tex.level = dst_level;<br>
> + image[1].u.tex.first_layer = 0;<br>
> + image[1].u.tex.last_layer =<br>
> + dst->target == PIPE_TEXTURE_3D ? u_minify(dst->depth0, dst_level) - 1<br>
> + : (unsigned)(dst->array_size - 1);<br>
> +<br>
> + if (src->format == PIPE_FORMAT_R9G9B9E5_FLOAT)<br>
> + image[0].format = image[1].format = PIPE_FORMAT_R32_UINT;<br>
> +<br>
> + /* SNORM8 blitting has precision issues on some chips. Use the SINT<br>
> + * equivalent instead, which doesn't force DCC decompression.<br>
> + * Note that some chips avoid this issue by using SDMA.<br>
> + */<br>
> + if (util_format_is_snorm8(dst->format)) {<br>
> + image[0].format = image[1].format =<br>
> + util_format_snorm8_to_sint8(dst->format);<br>
> + }<br>
> +<br>
> + ctx->set_shader_images(ctx, PIPE_SHADER_COMPUTE, 0, 2, image);<br>
> +<br>
> + struct pipe_grid_info info = {0};<br>
> +<br>
> + if (dst->target == PIPE_TEXTURE_1D_ARRAY && src->target == PIPE_TEXTURE_1D_ARRAY) {<br>
> + if (!sctx->cs_copy_image_1d_array)<br>
> + sctx->cs_copy_image_1d_array =<br>
> + si_create_copy_image_compute_shader_1d_array(ctx);<br>
> + ctx->bind_compute_state(ctx, sctx->cs_copy_image_1d_array);<br>
> + info.block[0] = 64;<br>
> + info.last_block[0] = width % 64;<br>
> + info.block[1] = 1;<br>
> + info.block[2] = 1;<br>
> + info.grid[0] = DIV_ROUND_UP(width, 64);<br>
> + info.grid[1] = depth;<br>
> + info.grid[2] = 1;<br>
> + } else {<br>
> + if (!sctx->cs_copy_image)<br>
> + sctx->cs_copy_image = si_create_copy_image_compute_shader(ctx);<br>
> + ctx->bind_compute_state(ctx, sctx->cs_copy_image);<br>
> + info.block[0] = 8;<br>
> + info.last_block[0] = width % 8;<br>
> + info.block[1] = 8;<br>
> + info.last_block[1] = height % 8;<br>
> + info.block[2] = 1;<br>
> + info.grid[0] = DIV_ROUND_UP(width, 8);<br>
> + info.grid[1] = DIV_ROUND_UP(height, 8);<br>
> + info.grid[2] = depth;<br>
> + }<br>
> +<br>
> + ctx->launch_grid(ctx, &info);<br>
> +<br>
> + sctx->flags |= SI_CONTEXT_CS_PARTIAL_FLUSH |<br>
> + (sctx->chip_class <= VI ? SI_CONTEXT_WRITEBACK_GLOBAL_L2 : 0) |<br>
> + si_get_flush_flags(sctx, SI_COHERENCY_SHADER, L2_STREAM);<br>
> + ctx->bind_compute_state(ctx, saved_cs);<br>
> + ctx->set_shader_images(ctx, PIPE_SHADER_COMPUTE, 0, 2, saved_image);<br>
> + ctx->set_constant_buffer(ctx, PIPE_SHADER_COMPUTE, 0, &saved_cb);<br>
> +}<br>
> +<br>
<br>
> +void *si_create_copy_image_compute_shader(struct pipe_context *ctx)<br>
> +{<br>
> + static const char text[] =<br>
> + "COMP\n"<br>
> + "PROPERTY CS_FIXED_BLOCK_WIDTH 8\n"<br>
> + "PROPERTY CS_FIXED_BLOCK_HEIGHT 8\n"<br>
> + "PROPERTY CS_FIXED_BLOCK_DEPTH 1\n"<br>
> + "DCL SV[0], THREAD_ID\n"<br>
> + "DCL SV[1], BLOCK_ID\n"<br>
> + "DCL IMAGE[0], 2D_ARRAY, PIPE_FORMAT_R32G32B32A32_FLOAT, WR\n"<br>
> + "DCL IMAGE[1], 2D_ARRAY, PIPE_FORMAT_R32G32B32A32_FLOAT, WR\n"<br>
> + "DCL CONST[0][0..1]\n" // 0:xyzw 1:xyzw<br>
> + "DCL TEMP[0..4], LOCAL\n"<br>
> + "IMM[0] UINT32 {8, 1, 0, 0}\n"<br>
> + "MOV TEMP[0].xyz, CONST[0][0].xyzw\n"<br>
> + "UMAD TEMP[1].xyz, SV[1].xyzz, IMM[0].xxyy, SV[0].xyzz\n"<br>
> + "UADD TEMP[2].xyz, TEMP[1].xyzx, TEMP[0].xyzx\n"<br>
> + "LOAD TEMP[3], IMAGE[0], TEMP[2].xyzx, 2D_ARRAY, PIPE_FORMAT_R32G32B32A32_FLOAT\n"<br>
> + "MOV TEMP[4].xyz, CONST[0][1].xyzw\n"<br>
> + "UADD TEMP[2].xyz, TEMP[1].xyzx, TEMP[4].xyzx\n"<br>
> + "STORE IMAGE[1], TEMP[2].xyzz, TEMP[3], 2D_ARRAY, PIPE_FORMAT_R32G32B32A32_FLOAT\n"<br>
> + "END\n";<br>
> +<br>
> + struct tgsi_token tokens[1024];<br>
> + struct pipe_compute_state state = {0};<br>
> +<br>
> + if (!tgsi_text_translate(text, tokens, ARRAY_SIZE(tokens))) {<br>
> + assert(false);<br>
> + return NULL;<br>
> + }<br>
> +<br>
> + state.ir_type = PIPE_SHADER_IR_TGSI;<br>
> + state.prog = tokens;<br>
> +<br>
> + return ctx->create_compute_state(ctx, &state);<br>
> +}<br>
> +<br>
><br>
Hi,<br>
<br>
Here is my summary of my understanding of the proposal implementation <br>
for the copy implementation:<br>
<br>
. Store input and output (x, y, z) offsets into a constant buffer<br>
. (8, 8) workgroups<br>
. Each workitem copies pixel (x+get_group_id(0)*8+get_local_id(0), <br>
y+get_group_id(1)*8+get_local_id(1), <br>
z+get_group_id(2)*8+get_local_id(2)). The pixel is RGBA.<br></blockquote><div><br></div><div>The block size in Z is 1.<br></div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
<br>
Some questions:<br>
. What happens when the textures do not have some components ? R32F for <br>
example<br></blockquote><div><br></div><div>Components that don't exist are not stored.<br></div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
. I'm not familiar with using images in compute shaders, but is it ok to <br>
declare as ARGB32F even if the input/output data is not float ?<br></blockquote><div><br></div><div>The driver ignores the format specified by shaders.<br></div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
<br>
Some comments:<br>
<br>
. If src_x, dstx, etcs are not multiple of (8, 8), the workgroups won't <br>
be aligned well with the tiling pattern. Fortunately cache should <br>
mitigate the loss, but if that's an important case to handle, one could <br>
write the shader differently to have all workgroups (except at border) <br>
aligned. I guess one can benchmark see if that tiling alignment matters <br>
much here.<br></blockquote><div><br></div><div>That would complicate the shaders too much.</div><div><br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
. Overhead can be reduced by copying several pixels per work-item.<br></blockquote><div><br></div><div>Patches welcome. :)</div><div><br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
. If the src and dst region are perfectly aligned with the tiling <br>
pattern, the copy can be reduced to just moving a rectangle of memory <br>
(no tiling) and could be implemented with dma_copy if no conversion is <br>
needed or with a shader using buffers (no images), which would avoid <br>
using the image sampling hw which I believe can be more limiting than <br>
sampling a buffer when there is a lot of wavefronts. The data conversion <br>
can be done for no cost in the shader as it should be memory bound.<br></blockquote><div><br></div><div>Too complicated.<br></div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
. (8, 8) is not optimal for linear tiled images (but I guess we don't <br>
often get to use them with resource_copy_region).<br></blockquote><div><br></div><div></div><div>Yes, linear -> linear copies are slower. tiled <-> linear copies wouldn't be improved.</div><div><br></div><div>Marek</div></div></div>