[PATCH 2/2] drm/amdkfd: use cache GTT buffer for PQ and wb pool

Mon Nov 11 00:06:24 UTC 2024

[AMD Official Use Only - AMD Internal Distribution Only]

>> All prior loads/stores are supposed to be globally visible. Hence mb() followed by a write pointer update also should guarantee the same (From Arch manual).

        The MFENCE instruction establishes a memory fence for both loads and stores. The processor ensures that no load or store after MFENCE will become globally visible *until all loads and stores before MFENCE are globally visible.*

<<

Lijo, two things I shall state here:

1) from our experience it is not true, the WC storage buffer from CPU is not flushed when mfence presents, like I said, we did many .. many experiment before submitting this patch
2) from Intel's CPU spec (Customer's platform is with Intel CPU), the WC storage doesn't belong to the cache nor to the regular CPU internal storage place. (I can help to get more literal and formal descriptions later for you for double confirm, this is what I heard from customer during on-site support)
3) without this patch, when the issues happens, we can see the "last header dump" is 0, and you can see in KFD code the HIQ 's PQ was previously cleared by "memset()" with 0 and later followed the PM4 fulfilling, so the thing is the 0 is fetched by MEC but the corrected PM4 was somehow stuck in WC storage buffer occasionally.

At last, from any theorical perspective the ring buffer should be snooped to guarantee the coherence between CPU and GPU, again, I don't see any reason not to honor this point for HIQ given the statues of all GFX queues, and even the user queues allocated by H-S-A runtime is also cache mapping.

Monk Liu | Cloud GPU & Virtualization | AMD

-----Original Message-----
From: Lazar, Lijo <Lijo.Lazar at amd.com>
Sent: Friday, November 8, 2024 7:26 PM
To: Liu, Monk <Monk.Liu at amd.com>; Koenig, Christian <Christian.Koenig at amd.com>; Alex Deucher <alexdeucher at gmail.com>; Zhao, Victor <Victor.Zhao at amd.com>
Cc: amd-gfx at lists.freedesktop.org; Yang, Philip <Philip.Yang at amd.com>; Kuehling, Felix <Felix.Kuehling at amd.com>
Subject: Re: [PATCH 2/2] drm/amdkfd: use cache GTT buffer for PQ and wb pool

On 11/8/2024 4:29 PM, Liu, Monk wrote:
> [AMD Official Use Only - AMD Internal Distribution Only]
>
> To be clear for the mb() approach: Even if we insert mb() in prior to amdgpu_ring_set_wptr(ring), GPU might still fetch stalled data from PQ due to USWC attributes.
>
Inserting an mb() doesn't cause WC buffer flush is a wrong assumption.

All prior loads/stores are supposed to be globally visible. Hence mb() followed by a write pointer update also should guarantee the same (From Arch manual).

        The MFENCE instruction establishes a memory fence for both loads and stores. The processor ensures that no load or store after MFENCE will become globally visible *until all loads and stores before MFENCE are globally visible.*

Ignoring the amdgpu driver part of it - if mb() is not working as expected as you claim that means something is wrong with the system.

USWC or WB for ring type may still be a separate discussion.

Thanks,
Lijo

> The issue here is not the re-ordering but the stalled PQ.
>
> Monk Liu | Cloud GPU & Virtualization | AMD
>
>
>
> -----Original Message-----
> From: Liu, Monk
> Sent: Friday, November 8, 2024 6:22 PM
> To: Koenig, Christian <Christian.Koenig at amd.com>; Lazar, Lijo
> <Lijo.Lazar at amd.com>; Alex Deucher <alexdeucher at gmail.com>; Zhao,
> Victor <Victor.Zhao at amd.com>
> Cc: amd-gfx at lists.freedesktop.org; Yang, Philip <Philip.Yang at amd.com>;
> Kuehling, Felix <Felix.Kuehling at amd.com>
> Subject: RE: [PATCH 2/2] drm/amdkfd: use cache GTT buffer for PQ and
> wb pool
>
> Christian/Lijo
>
> We verified all approaches, and we know what works and not works, obviously the mb() doesn't work.
>
> The way of mb() between set wptr_polling and kicking off doorbell is theoretically correct, and I'm not object to do so... but this won't resolve the issue we hit.
> First of all, USWC will have some chance that the data is still in CPU's WC storage place and not flushed to the memory and even with MB() get rid of re-ordering GPU might still have a chance to read stalled data from ring buffer as long as it is mapped USWC.
>
> This is why only cache plus snoop memory can get rid of inconsistence issues.
>
> Besides, do you know what's the rational to keep all GFX KCQ cache+snoop but only HIQ/KIQ from KFD configured to USWC ?
>
> For performance concern that looks to me always the second priority compared to "correct" especially under the case HIQ contributes very little to ROCM performance when switching to cache mapping.
>
>
> Monk Liu | Cloud GPU & Virtualization | AMD
>
>
>
>
>
>
>
> -----Original Message-----
> From: Koenig, Christian <Christian.Koenig at amd.com>
> Sent: Thursday, November 7, 2024 7:57 PM
> To: Lazar, Lijo <Lijo.Lazar at amd.com>; Alex Deucher
> <alexdeucher at gmail.com>; Zhao, Victor <Victor.Zhao at amd.com>
> Cc: amd-gfx at lists.freedesktop.org; Liu, Monk <Monk.Liu at amd.com>; Yang,
> Philip <Philip.Yang at amd.com>; Kuehling, Felix <Felix.Kuehling at amd.com>
> Subject: Re: [PATCH 2/2] drm/amdkfd: use cache GTT buffer for PQ and
> wb pool
>
> Am 07.11.24 um 06:58 schrieb Lazar, Lijo:
>> On 11/6/2024 8:42 PM, Alex Deucher wrote:
>>> On Wed, Nov 6, 2024 at 1:49 AM Victor Zhao <Victor.Zhao at amd.com> wrote:
>>>> From: Monk Liu <Monk.Liu at amd.com>
>>>>
>>>> As cache GTT buffer is snooped, this way the coherence between CPU
>>>> write and GPU fetch is guaranteed, but original code uses WC +
>>>> unsnooped for HIQ PQ(ring buffer) which introduces coherency issues:
>>>> MEC fetches a stall data from PQ and leads to MEC hang.
>>> Can you elaborate on this?  I can see CPU reads being slower because
>>> the memory is uncached, but the ring buffer is mostly writes anyway.
>>> IIRC, the driver uses USWC for most if not all of the other ring
>>> buffers managed by the kernel.  Why aren't those a problem?
>> We have this on other rings -
>>          mb();
>>          amdgpu_ring_set_wptr(ring);
>>
>> I think the solution should be to use barrier before write pointer
>> updates rather than relying on PCIe snooping.
>
> Yeah, completely agree as well. The barrier also takes care of preventing the compiler from re-ordering writes.
>
> Regards,
> Christian.
>
>>
>> Thanks,
>> Lijo
>>
>>> Alex
>>>
>>>> Signed-off-by: Monk Liu <Monk.Liu at amd.com>
>>>> ---
>>>>   drivers/gpu/drm/amd/amdkfd/kfd_device.c | 2 +-
>>>>   1 file changed, 1 insertion(+), 1 deletion(-)
>>>>
>>>> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device.c
>>>> b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
>>>> index 1f1d79ac5e6c..fb087a0ff5bc 100644
>>>> --- a/drivers/gpu/drm/amd/amdkfd/kfd_device.c
>>>> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
>>>> @@ -779,7 +779,7 @@ bool kgd2kfd_device_init(struct kfd_dev *kfd,
>>>>          if (amdgpu_amdkfd_alloc_gtt_mem(
>>>>                          kfd->adev, size, &kfd->gtt_mem,
>>>>                          &kfd->gtt_start_gpu_addr, &kfd->gtt_start_cpu_ptr,
>>>> -                       false, true)) {
>>>> +                       false, false)) {
>>>>                  dev_err(kfd_device, "Could not allocate %d bytes\n", size);
>>>>                  goto alloc_gtt_mem_failure;
>>>>          }
>>>> --
>>>> 2.34.1
>>>>
>