[PATCH] drm/amdgpu: optimize the padding with hw optimization

Sun Aug 4 05:28:46 UTC 2024

On Fri, Aug 2, 2024 at 6:10 AM Lazar, Lijo <lijo.lazar at amd.com> wrote:
>
>
>
> On 8/2/2024 12:25 AM, Marek Olšák wrote:
> > On Thu, Aug 1, 2024, 03:37 Christian König <christian.koenig at amd.com
> > <mailto:christian.koenig at amd.com>> wrote:
> >
> >     __
> >     Am 01.08.24 um 08:53 schrieb Marek Olšák:
> >>     On Thu, Aug 1, 2024, 00:28 Khatri, Sunil <sukhatri at amd.com
> >>     <mailto:sukhatri at amd.com>> wrote:
> >>
> >>
> >>         On 8/1/2024 8:49 AM, Marek Olšák wrote:
> >>         >> +       /* Header is at index 0, followed by num_nops - 1
> >>         NOP packet's */
> >>         >> +       for (i = 1; i < num_nop; i++)
> >>         >> +               amdgpu_ring_write(ring, ring->funcs->nop);
> >>         > This loop should be removed. It's unnecessary CPU overhead
> >>         and we
> >>         > should never get more than 0x3fff NOPs (maybe use BUG_ON).
> >>         Leaving the
> >>         > whole packet body uninitialized is the fastest option.
> >>         That was the original intent to just move the WPTR for the no
> >>         of nops
> >>         and tried too. Based on Christian inputs we should not let the
> >>         nops packet
> >>
> >>         as garbage or whatever was there originally as a threat/safety
> >>         measure.
> >>
> >>
> >>     It doesn't help safety. It can only be read by the GPU with
> >>     kernel-level permissions.
> >>
> >>     Initializing the packet body is useless and adds CPU overhead,
> >>     especially with the 256 NOPs or so that we use for no reason.
> >
> >     Not filling the remaining ring buffers with NOPs is a pretty clear
> >     NAK from my side. Leaving garbage in the ring buffer is not even
> >     remotely defensive.
> >
> >
> > What are you defending against? You know the ring is kernel-owned
> > memory, right?
> >
>
> Aside from that, the true hardware behavior is that CP still fetches the
> words and discards them. It's not the same mentioned in the description.
> So the only optimization it allows is to move the pointer without
> filling/caring about the contents as hardware also doesn't care about
> them. The notion of filling those unused region is exactly opposite of
> the intention. If that's the case, nothing is gained and just drop these
> patches.

It's correct that it doesn't reduce fetching, but this optimization is
not about fetching. It's about reducing the number of instructions
that the firmware must execute. Single dword NOPs are quite expensive
for their size.

Marek