<div dir="auto"><div><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Thu, Aug 1, 2024, 00:28 Khatri, Sunil <<a href="mailto:sukhatri@amd.com">sukhatri@amd.com</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><br> On 8/1/2024 8:49 AM, Marek Olšák wrote:<br> > On Tue, Jul 30, 2024 at 8:43 AM Sunil Khatri <<a href="mailto:sunil.khatri@amd.com" target="_blank" rel="noreferrer">sunil.khatri@amd.com</a>> wrote:<br> >> Adding NOP packets one by one in the ring<br> >> does not use the CP efficiently.<br> >><br> >> Solution:<br> >> Use CP optimization while adding NOP packet's so PFP<br> >> can discard NOP packets based on information of count<br> >> from the Header instead of fetching all NOP packets<br> >> one by one.<br> >><br> >> Cc: Christian König <<a href="mailto:christian.koenig@amd.com" target="_blank" rel="noreferrer">christian.koenig@amd.com</a>><br> >> Cc: Pierre-Eric Pelloux-Prayer <<a href="mailto:pierre-eric.pelloux-prayer@amd.com" target="_blank" rel="noreferrer">pierre-eric.pelloux-prayer@amd.com</a>><br> >> Cc: Tvrtko Ursulin <<a href="mailto:tursulin@igalia.com" target="_blank" rel="noreferrer">tursulin@igalia.com</a>><br> >> Cc: Marek Olšák <<a href="mailto:marek.olsak@amd.com" target="_blank" rel="noreferrer">marek.olsak@amd.com</a>><br> >> Signed-off-by: Sunil Khatri <<a href="mailto:sunil.khatri@amd.com" target="_blank" rel="noreferrer">sunil.khatri@amd.com</a>><br> >> ---<br> >> drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c | 24 +++++++++++++++++++++---<br> >> 1 file changed, 21 insertions(+), 3 deletions(-)<br> >><br> >> diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c b/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c<br> >> index 853084a2ce7f..edf5b5c4d185 100644<br> >> --- a/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c<br> >> +++ b/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c<br> >> @@ -9397,6 +9397,24 @@ static void gfx_v10_0_emit_mem_sync(struct amdgpu_ring *ring)<br> >> amdgpu_ring_write(ring, gcr_cntl); /* GCR_CNTL */<br> >> }<br> >><br> >> +static void amdgpu_gfx10_ring_insert_nop(struct amdgpu_ring *ring, uint32_t num_nop)<br> >> +{<br> >> + int i;<br> >> +<br> >> + /* Header itself is a NOP packet */<br> >> + if (num_nop == 1) {<br> >> + amdgpu_ring_write(ring, ring->funcs->nop);<br> >> + return;<br> >> + }<br> >> +<br> >> + /* Max HW optimization till 0x3ffe, followed by remaining one NOP at a time*/<br> >> + amdgpu_ring_write(ring, PACKET3(PACKET3_NOP, min(num_nop - 2, 0x3ffe)));<br> >> +<br> >> + /* Header is at index 0, followed by num_nops - 1 NOP packet's */<br> >> + for (i = 1; i < num_nop; i++)<br> >> + amdgpu_ring_write(ring, ring->funcs->nop);<br> > This loop should be removed. It's unnecessary CPU overhead and we<br> > should never get more than 0x3fff NOPs (maybe use BUG_ON). Leaving the<br> > whole packet body uninitialized is the fastest option.<br> That was the original intent to just move the WPTR for the no of nops <br> and tried too. Based on Christian inputs we should not let the nops packet<br> <br> as garbage or whatever was there originally as a threat/safety measure.</blockquote></div></div><div dir="auto"><br></div><div dir="auto">It doesn't help safety. It can only be read by the GPU with kernel-level permissions.</div><div dir="auto"><br></div><div dir="auto">Initializing the packet body is useless and adds CPU overhead, especially with the 256 NOPs or so that we use for no reason.</div><div dir="auto"><br></div><div dir="auto">Marek</div><div dir="auto"><br></div><div dir="auto"><div class="gmail_quote"><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"> <br> So CPU side there isnt any optimization but just CP will skip all these <br> so GPU side should see the benefit.<br> <br> Regards<br> Sunil Khatri<br> <br> ><br> > Marek<br> ><br> >> +}<br> >> +<br> >> static void gfx_v10_ip_print(void *handle, struct drm_printer *p)<br> >> {<br> >> struct amdgpu_device *adev = (struct amdgpu_device *)handle;<br> >> @@ -9588,7 +9606,7 @@ static const struct amdgpu_ring_funcs gfx_v10_0_ring_funcs_gfx = {<br> >> .emit_hdp_flush = gfx_v10_0_ring_emit_hdp_flush,<br> >> .test_ring = gfx_v10_0_ring_test_ring,<br> >> .test_ib = gfx_v10_0_ring_test_ib,<br> >> - .insert_nop = amdgpu_ring_insert_nop,<br> >> + .insert_nop = amdgpu_gfx10_ring_insert_nop,<br> >> .pad_ib = amdgpu_ring_generic_pad_ib,<br> >> .emit_switch_buffer = gfx_v10_0_ring_emit_sb,<br> >> .emit_cntxcntl = gfx_v10_0_ring_emit_cntxcntl,<br> >> @@ -9629,7 +9647,7 @@ static const struct amdgpu_ring_funcs gfx_v10_0_ring_funcs_compute = {<br> >> .emit_hdp_flush = gfx_v10_0_ring_emit_hdp_flush,<br> >> .test_ring = gfx_v10_0_ring_test_ring,<br> >> .test_ib = gfx_v10_0_ring_test_ib,<br> >> - .insert_nop = amdgpu_ring_insert_nop,<br> >> + .insert_nop = amdgpu_gfx10_ring_insert_nop,<br> >> .pad_ib = amdgpu_ring_generic_pad_ib,<br> >> .emit_wreg = gfx_v10_0_ring_emit_wreg,<br> >> .emit_reg_wait = gfx_v10_0_ring_emit_reg_wait,<br> >> @@ -9659,7 +9677,7 @@ static const struct amdgpu_ring_funcs gfx_v10_0_ring_funcs_kiq = {<br> >> .emit_fence = gfx_v10_0_ring_emit_fence_kiq,<br> >> .test_ring = gfx_v10_0_ring_test_ring,<br> >> .test_ib = gfx_v10_0_ring_test_ib,<br> >> - .insert_nop = amdgpu_ring_insert_nop,<br> >> + .insert_nop = amdgpu_gfx10_ring_insert_nop,<br> >> .pad_ib = amdgpu_ring_generic_pad_ib,<br> >> .emit_rreg = gfx_v10_0_ring_emit_rreg,<br> >> .emit_wreg = gfx_v10_0_ring_emit_wreg,<br> >> --<br> >> 2.34.1<br> >><br> </blockquote></div></div></div>