[PATCH 1/1] drm/amdkfd: Do not ignore requested queue size during allocation

Felix Kuehling felix.kuehling at amd.com
Fri Dec 1 19:37:30 UTC 2017


On 2017-11-30 06:51 PM, Jan Vesely wrote:
>
> It's not a userspace queue that stops. I'm using kernel dbgdev to issue
> wave_resume commands. (waves are halted after executing
> s_sendmsg_halt).
> I bumped KFD_KERNEL_QUEUE_SIZE to 16KB to make sure all 320 resume
> commads fit (otherwise I get spurious ENOMEM when the queue is full but
> still advancing).

Sorry, didn't see this part of your message before.

To see the actual state of the DIQ in the hardware, you should look at
the HQD. You can find the matching HQD by looking at the queue base
address (cp_hqd_pq_base) which is at offset 0x220 in the MQD and offset
0xc934 in the register space (HQD).

I've debugged some obscure CP hangs involving the DIQ and wave control
commands before, that required help from the firmware team. The fix was
to remove synchronization with release_mem packets that could hang in
combination with wave control. It turned out the synchronization wasn't
really needed anyway. But it had some implications for how memory was
managed. I had to add code to allocate the IB on the queue (using a NOP
command), so I wouldn't have to free it explicitly (which would require
synchronization). I think that code is still not 100% correct. When the
queue is nearly full, an IB may get overwritten. I'd have to restructure
the code to allocate the IB after the commands that submit the IB, so
that the IB can't get overwritten until after the IB execution is finished.

Regards,
  Felix

>
> thanks,
> Jan
>
>> Regards,
>>   Felix
>>
>>
>> On 2017-11-29 04:43 PM, Jan Vesely wrote:
>>> On Mon, 2017-11-20 at 14:22 -0500, Felix Kuehling wrote:
>>>> I think this patch is not correct. The EOP-mem is not associated with
>>>> the queue size. The EOP buffer is a separate buffer used by the firmware
>>>> to handle command completion. As I understand it, this allows more
>>>> concurrency, while still making it look like all commands in the queue
>>>> are completing in order.
>>> thanks for the explanation. I was looking for a source of a CP hang
>>> (rptr stops advancing), but bumping the eop size actually mode things
>>> worse. Is there a way to find out if a queue got disabled and for what
>>> reason? (I'm running ROCK-1.6.x based kernel)
>>>
>>> thanks,
>>> Jan
>>>
>>>> Regards,
>>>>   Felix
>>>>
>>>>
>>>> On 2017-11-19 03:19 AM, Oded Gabbay wrote:
>>>>> On Thu, Nov 16, 2017 at 11:36 PM, Jan Vesely <jan.vesely at rutgers.edu> wrote:
>>>>>> Signed-off-by: Jan Vesely <jan.vesely at rutgers.edu>
>>>>>> ---
>>>>>>  drivers/gpu/drm/amd/amdkfd/kfd_kernel_queue_vi.c | 5 +++--
>>>>>>  1 file changed, 3 insertions(+), 2 deletions(-)
>>>>>>
>>>>>> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_kernel_queue_vi.c b/drivers/gpu/drm/amd/amdkfd/kfd_kernel_queue_vi.c
>>>>>> index f1d48281e322..b3bee39661ab 100644
>>>>>> --- a/drivers/gpu/drm/amd/amdkfd/kfd_kernel_queue_vi.c
>>>>>> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_kernel_queue_vi.c
>>>>>> @@ -37,15 +37,16 @@ static bool initialize_vi(struct kernel_queue *kq, struct kfd_dev *dev,
>>>>>>                         enum kfd_queue_type type, unsigned int queue_size)
>>>>>>  {
>>>>>>         int retval;
>>>>>> +       unsigned int size = ALIGN(queue_size, PAGE_SIZE);
>>>>>>
>>>>>> -       retval = kfd_gtt_sa_allocate(dev, PAGE_SIZE, &kq->eop_mem);
>>>>>> +       retval = kfd_gtt_sa_allocate(dev, size, &kq->eop_mem);
>>>>>>         if (retval != 0)
>>>>>>                 return false;
>>>>>>
>>>>>>         kq->eop_gpu_addr = kq->eop_mem->gpu_addr;
>>>>>>         kq->eop_kernel_addr = kq->eop_mem->cpu_ptr;
>>>>>>
>>>>>> -       memset(kq->eop_kernel_addr, 0, PAGE_SIZE);
>>>>>> +       memset(kq->eop_kernel_addr, 0, size);
>>>>>>
>>>>>>         return true;
>>>>>>  }
>>>>>> --
>>>>>> 2.13.6
>>>>>>
>>>>>> _______________________________________________
>>>>>> amd-gfx mailing list
>>>>>> amd-gfx at lists.freedesktop.org
>>>>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>>>>> Thanks!
>>>>> Applied to -next tree
>>>>> Oded
>>>>> _______________________________________________
>>>>> amd-gfx mailing list
>>>>> amd-gfx at lists.freedesktop.org
>>>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>>



More information about the amd-gfx mailing list