[PATCH 1/1] drm/amdkfd: Do not ignore requested queue size during allocation

Jan Vesely jan.vesely at rutgers.edu
Thu Nov 30 23:51:29 UTC 2017


On Wed, 2017-11-29 at 16:58 -0500, Felix Kuehling wrote:
> You can see the state of the queues in debugfs:
> /sys/kernel/debug/kfd/... You can look at MQDs and HQDs.

thanks. how do I decode the information?
The rptr always stops at pos 60 which looks like this in mqds:

 DIQ on device 45a2
    00000000: c0310800 00004000 00000000 00000000 00000000 00000000 00000000 00000000
    00000020: 00000000 00000000 00000000 00000001 00000000 00000000 00000000 00000000
    00000040: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 ffffffff
    00000060: ffffffff 00000000 ffffffff ffffffff 00000000 00000000 00000000 00000000

If I understood correctly that's the queue dump, so those fffffs look
wrong

> 
> If your application isn't stopping queues deliberately, queues get
> disabled by evictions, usually temporarily. You'll see kernel messages
> when that happens.
> 
> A VM fault will result in queues of the offending process getting
> disabled permanently. Again, you'll see messages about that in the
> kernel log.
> 
> The RPTR can also stop advancing if you have an infinite loop in a
> shader program, or just a shader that takes a very long time to execute.
> Or maybe if you have some dependencies (barriers) in your AQL packets
> that never get satisfied.
> 
> The function you changed only affects the HIQ, the queue that KFD uses
> to control the HWS. It does not affect user mode queues. If your problem
> is with a user mode queue, your change should have no effect at all.

It's not a userspace queue that stops. I'm using kernel dbgdev to issue
wave_resume commands. (waves are halted after executing
s_sendmsg_halt).
I bumped KFD_KERNEL_QUEUE_SIZE to 16KB to make sure all 320 resume
commads fit (otherwise I get spurious ENOMEM when the queue is full but
still advancing).

thanks,
Jan

> 
> Regards,
>   Felix
> 
> 
> On 2017-11-29 04:43 PM, Jan Vesely wrote:
> > On Mon, 2017-11-20 at 14:22 -0500, Felix Kuehling wrote:
> > > I think this patch is not correct. The EOP-mem is not associated with
> > > the queue size. The EOP buffer is a separate buffer used by the firmware
> > > to handle command completion. As I understand it, this allows more
> > > concurrency, while still making it look like all commands in the queue
> > > are completing in order.
> > 
> > thanks for the explanation. I was looking for a source of a CP hang
> > (rptr stops advancing), but bumping the eop size actually mode things
> > worse. Is there a way to find out if a queue got disabled and for what
> > reason? (I'm running ROCK-1.6.x based kernel)
> > 
> > thanks,
> > Jan
> > 
> > > Regards,
> > >   Felix
> > > 
> > > 
> > > On 2017-11-19 03:19 AM, Oded Gabbay wrote:
> > > > On Thu, Nov 16, 2017 at 11:36 PM, Jan Vesely <jan.vesely at rutgers.edu> wrote:
> > > > > Signed-off-by: Jan Vesely <jan.vesely at rutgers.edu>
> > > > > ---
> > > > >  drivers/gpu/drm/amd/amdkfd/kfd_kernel_queue_vi.c | 5 +++--
> > > > >  1 file changed, 3 insertions(+), 2 deletions(-)
> > > > > 
> > > > > diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_kernel_queue_vi.c b/drivers/gpu/drm/amd/amdkfd/kfd_kernel_queue_vi.c
> > > > > index f1d48281e322..b3bee39661ab 100644
> > > > > --- a/drivers/gpu/drm/amd/amdkfd/kfd_kernel_queue_vi.c
> > > > > +++ b/drivers/gpu/drm/amd/amdkfd/kfd_kernel_queue_vi.c
> > > > > @@ -37,15 +37,16 @@ static bool initialize_vi(struct kernel_queue *kq, struct kfd_dev *dev,
> > > > >                         enum kfd_queue_type type, unsigned int queue_size)
> > > > >  {
> > > > >         int retval;
> > > > > +       unsigned int size = ALIGN(queue_size, PAGE_SIZE);
> > > > > 
> > > > > -       retval = kfd_gtt_sa_allocate(dev, PAGE_SIZE, &kq->eop_mem);
> > > > > +       retval = kfd_gtt_sa_allocate(dev, size, &kq->eop_mem);
> > > > >         if (retval != 0)
> > > > >                 return false;
> > > > > 
> > > > >         kq->eop_gpu_addr = kq->eop_mem->gpu_addr;
> > > > >         kq->eop_kernel_addr = kq->eop_mem->cpu_ptr;
> > > > > 
> > > > > -       memset(kq->eop_kernel_addr, 0, PAGE_SIZE);
> > > > > +       memset(kq->eop_kernel_addr, 0, size);
> > > > > 
> > > > >         return true;
> > > > >  }
> > > > > --
> > > > > 2.13.6
> > > > > 
> > > > > _______________________________________________
> > > > > amd-gfx mailing list
> > > > > amd-gfx at lists.freedesktop.org
> > > > > https://lists.freedesktop.org/mailman/listinfo/amd-gfx
> > > > 
> > > > Thanks!
> > > > Applied to -next tree
> > > > Oded
> > > > _______________________________________________
> > > > amd-gfx mailing list
> > > > amd-gfx at lists.freedesktop.org
> > > > https://lists.freedesktop.org/mailman/listinfo/amd-gfx
> 
> 
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 833 bytes
Desc: This is a digitally signed message part
URL: <https://lists.freedesktop.org/archives/amd-gfx/attachments/20171130/d26e9b6a/attachment.sig>


More information about the amd-gfx mailing list