[PATCH v2] drm/amdkfd: Fix EXT_COHERENT memory allocation crash

Francis, David David.Francis at amd.com
Wed Oct 4 18:21:11 UTC 2023


[AMD Official Use Only - General]



On 2023-10-03 17:37, Felix Kuehling wrote:
On 2023-10-03 16:50, Philip Yang wrote:
If there is no VRAM domain, bo_node is NULL and this causes crash.
Refactor the change, and use the module parameter as higher privilege.

Need another patch to support override PTE flag on APU.

Fixes: 55d7e2001c7e ("drm/amdgpu: Add EXT_COHERENT memory allocation flags")
Signed-off-by: Philip Yang <Philip.Yang at amd.com><mailto:Philip.Yang at amd.com>
---
  drivers/gpu/drm/amd/amdkfd/kfd_svm.c | 18 +++++++-----------
  1 file changed, 7 insertions(+), 11 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
index 0d88698ae33f..305b2c54edfa 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
@@ -1248,26 +1248,22 @@ svm_range_get_pte_flags(struct kfd_node *node,
              break;
      case IP_VERSION(9, 4, 3):
              mtype_local = amdgpu_mtype_local == 1 ? AMDGPU_VM_MTYPE_NC :
-                          (amdgpu_mtype_local == 2 ? AMDGPU_VM_MTYPE_CC : AMDGPU_VM_MTYPE_RW);
+                           (amdgpu_mtype_local == 2 || ext_coherent ?
+                                     AMDGPU_VM_MTYPE_CC : AMDGPU_VM_MTYPE_RW);

We had some offline discussion where I thought that MTYPE_NC should
become MTYPE_UC when ext_coherent is enabled to get the desired memory
semantics. With that idea in mind, this would become a bit more messy,
but here it goes, as clean as I can make it:

-               mtype_local = amdgpu_mtype_local == 1 ? AMDGPU_VM_MTYPE_NC :
-                            (amdgpu_mtype_local == 2 ? AMDGPU_VM_MTYPE_CC : AMDGPU_VM_MTYPE_RW);
+               mtype_local = amdgpu_mtype_local == 1 && !ext_coherent ? AMDGPU_VM_MTYPE_NC :
+                            (amdgpu_mtype_local == 1 &&  ext_coherent ? AMDGPU_VM_MTYPE_UC :
+                            (amdgpu_mtype_local == 2 ||  ext_coherent ? AMDGPU_VM_MTYPE_CC :
+                                                                        AMDGPU_VM_MTYPE_RW));


That ternary looks fairly gnarly. I think it would be worth the extra ink to write

                   mtype_local = amdgpu_mtype_local == 1 ? AMDGPU_VM_MTYPE_NC :
                            (amdgpu_mtype_local == 2 ? AMDGPU_VM_MTYPE_CC : AMDGPU_VM_MTYPE_RW);

                if (ext_coherent) {
                    if (amdgpu_mtype_local = 1)
                        mtype_local = AMDGPU_VM_MTYPE_UC;
                    else
                        mtype_local = AMDGPU_VM_MTYPE_CC;
                }

But maybe that could be fixed up in a follow up patch. Either way, for
the purpose of fixing the crash, this patch is

Reviewed-by: Felix Kuehling <Felix.Kuehling at amd.com><mailto:Felix.Kuehling at amd.com>


              snoop = true;
              if (uncached) {
                      mapping_flags |= AMDGPU_VM_MTYPE_UC;
-             } else if (ext_coherent) {
-                     /* local HBM region close to partition */
-                     if (bo_node->adev == node->adev &&
-                         (!bo_node->xcp || !node->xcp || bo_node->xcp->mem_id == node->xcp->mem_id))
-                             mapping_flags |= AMDGPU_VM_MTYPE_CC;
-                     else
-                             mapping_flags |= AMDGPU_VM_MTYPE_UC;
              } else if (domain == SVM_RANGE_VRAM_DOMAIN) {
                      /* local HBM region close to partition */
                      if (bo_node->adev == node->adev &&
                          (!bo_node->xcp || !node->xcp || bo_node->xcp->mem_id == node->xcp->mem_id))
                              mapping_flags |= mtype_local;
-                     /* local HBM region far from partition or remote XGMI GPU */
-                     else if (svm_nodes_in_same_hive(bo_node, node))
+                     /* local HBM region far from partition or remote XGMI GPU
+                      * with regular system scope coherence
+                      */
+                     else if (svm_nodes_in_same_hive(bo_node, node) && !ext_coherent)
                              mapping_flags |= AMDGPU_VM_MTYPE_NC;
-                     /* PCIe P2P */
+                     /* PCIe P2P or extended system scope coherence */
                      else
                              mapping_flags |= AMDGPU_VM_MTYPE_UC;

Would probably clearer if these two branches were swapped so the first was

(!svm_nodes_in_same_hive(bo_node, node) || ext_coherent)

Not a required change, though.

              /* system memory accessed by the APU */

This patch as written causes ext_coherent to no longer affect gfx9.4.3 APU devices, which it should.

The following (or equivalent) needs to be added just below this hunk

            if (num_possible_nodes() <= 1)
                mapping_flags |= mtype_local;
            else
-                 mapping_flags |= AMDGPU_VM_MTYPE_NC;
+                mapping_flags |= ext_coherent ? AMDGPU_VM_MTYPE_UC : AMDGPU_VM_MTYPE_NC;
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.freedesktop.org/archives/amd-gfx/attachments/20231004/c502372c/attachment.htm>


More information about the amd-gfx mailing list