<html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:w="urn:schemas-microsoft-com:office:word" xmlns:m="http://schemas.microsoft.com/office/2004/12/omml" xmlns="http://www.w3.org/TR/REC-html40">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
<meta name="Generator" content="Microsoft Word 15 (filtered medium)">
<style><!--
/* Font Definitions */
@font-face
{font-family:"Cambria Math";
panose-1:2 4 5 3 5 4 6 3 2 4;}
@font-face
{font-family:DengXian;
panose-1:2 1 6 0 3 1 1 1 1 1;}
@font-face
{font-family:Calibri;
panose-1:2 15 5 2 2 2 4 3 2 4;}
@font-face
{font-family:"\@DengXian";
panose-1:2 1 6 0 3 1 1 1 1 1;}
/* Style Definitions */
p.MsoNormal, li.MsoNormal, div.MsoNormal
{margin:0in;
font-size:11.0pt;
font-family:"Calibri",sans-serif;}
a:link, span.MsoHyperlink
{mso-style-priority:99;
color:blue;
text-decoration:underline;}
span.EmailStyle19
{mso-style-type:personal-reply;
font-family:"Calibri",sans-serif;
color:windowtext;}
.MsoChpDefault
{mso-style-type:export-only;
font-size:10.0pt;
mso-ligatures:none;}
@page WordSection1
{size:8.5in 11.0in;
margin:1.0in 1.0in 1.0in 1.0in;}
div.WordSection1
{page:WordSection1;}
/* List Definitions */
@list l0
{mso-list-id:174075927;
mso-list-template-ids:1622428444;}
ol
{margin-bottom:0in;}
ul
{margin-bottom:0in;}
--></style><!--[if gte mso 9]><xml>
<o:shapedefaults v:ext="edit" spidmax="1026" />
</xml><![endif]--><!--[if gte mso 9]><xml>
<o:shapelayout v:ext="edit">
<o:idmap v:ext="edit" data="1" />
</o:shapelayout></xml><![endif]-->
</head>
<body lang="EN-US" link="blue" vlink="purple" style="word-wrap:break-word">
<p style="font-family:Arial;font-size:10pt;color:#008000;margin:15pt;font-style:normal;font-weight:normal;text-decoration:none;" align="Left">
[Public]<br>
</p>
<br>
<div>
<div class="WordSection1">
<p>> I never saw this problem in my testing, probably because I never got my page tables evicted?<o:p></o:p></p>
<p class="MsoNormal">I observed this problem on APUs with default 512MB VRAM when allocating memory aggressively from different APPs.<o:p></o:p></p>
<p class="MsoNormal"><o:p> </o:p></p>
<p class="MsoNormal">Will try to modify the patch per your suggestions. Thanks!<o:p></o:p></p>
<p class="MsoNormal"><br>
Regards,<o:p></o:p></p>
<p class="MsoNormal">Lang<o:p></o:p></p>
<p class="MsoNormal"><o:p> </o:p></p>
<div style="border:none;border-left:solid blue 1.5pt;padding:0in 0in 0in 4.0pt">
<div>
<div style="border:none;border-top:solid #E1E1E1 1.0pt;padding:3.0pt 0in 0in 0in">
<p class="MsoNormal"><b>From:</b> Kuehling, Felix <Felix.Kuehling@amd.com> <br>
<b>Sent:</b> Wednesday, April 10, 2024 8:32 AM<br>
<b>To:</b> Koenig, Christian <Christian.Koenig@amd.com>; Yu, Lang <Lang.Yu@amd.com>; amd-gfx@lists.freedesktop.org<br>
<b>Subject:</b> Re: [PATCH] drm/amdkfd: make sure VM is ready for updating operations<o:p></o:p></p>
</div>
</div>
<p class="MsoNormal"><o:p> </o:p></p>
<p><o:p> </o:p></p>
<div>
<p class="MsoNormal">On 2024-04-08 3:55, Christian König wrote:<o:p></o:p></p>
</div>
<blockquote style="margin-top:5.0pt;margin-bottom:5.0pt">
<p class="MsoNormal">Am 07.04.24 um 06:52 schrieb Lang Yu: <br>
<br>
<o:p></o:p></p>
<blockquote style="margin-top:5.0pt;margin-bottom:5.0pt">
<p class="MsoNormal">When VM is in evicting state, amdgpu_vm_update_range would return -EBUSY.
<br>
Then restore_process_worker runs into a dead loop. <br>
<br>
Fixes: 2fdba514ad5a ("drm/amdgpu: Auto-validate DMABuf imports in compute VMs") <o:p>
</o:p></p>
</blockquote>
<p class="MsoNormal"><br>
Mhm, while it would be good to have this case handled as error it should never occur in practice since we should have validated the VM before validating the DMA-bufs.
<br>
<br>
@Felix isn't that something we have taken care of? <o:p></o:p></p>
</blockquote>
<p>The problem I saw when I implemented Auto-validate was, that migration of a BO invalidates its DMABuf attachments. So I need to validate the DMABuf attachments after validating the BOs they attach to. This auto-validation happens in amdgpu_vm_validate. So
I needed to do the VM validation after the BO validation. The problem now seems to be that the BO validation happens in the same loop as the page table update. And the page table update fails if the VM is not valid.<o:p></o:p></p>
<p>I never saw this problem in my testing, probably because I never got my page tables evicted?<o:p></o:p></p>
<p>Anyway, I think the solution is to split the BO validation and page table update into two separate loops in amdgpu_amdkfd_restore_process_pos:<o:p></o:p></p>
<ol start="1" type="1">
<li class="MsoNormal" style="mso-margin-top-alt:auto;mso-margin-bottom-alt:auto;mso-list:l0 level1 lfo1">
Validate BOs<o:p></o:p></li><li class="MsoNormal" style="mso-margin-top-alt:auto;mso-margin-bottom-alt:auto;mso-list:l0 level1 lfo1">
Validate VM (and DMABuf attachments)<o:p></o:p></li><li class="MsoNormal" style="mso-margin-top-alt:auto;mso-margin-bottom-alt:auto;mso-list:l0 level1 lfo1">
Update page tables for the BOs validated above<o:p></o:p></li></ol>
<p>Regards,<br>
Felix<o:p></o:p></p>
<p><o:p> </o:p></p>
<blockquote style="margin-top:5.0pt;margin-bottom:5.0pt">
<p class="MsoNormal"><br>
Regards, <br>
Christian. <br>
<br>
<br>
<br>
<o:p></o:p></p>
<blockquote style="margin-top:5.0pt;margin-bottom:5.0pt">
<p class="MsoNormal"><br>
Signed-off-by: Lang Yu <a href="mailto:Lang.Yu@amd.com"><Lang.Yu@amd.com></a> <br>
--- <br>
drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c | 6 ++++++ <br>
1 file changed, 6 insertions(+) <br>
<br>
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
<br>
index 0ae9fd844623..8c71fe07807a 100644 <br>
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c <br>
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c <br>
@@ -2900,6 +2900,12 @@ int amdgpu_amdkfd_gpuvm_restore_process_bos(void *info, struct dma_fence __rcu *
<br>
amdgpu_sync_create(&sync_obj); <br>
+ ret = process_validate_vms(process_info, NULL); <br>
+ if (ret) { <br>
+ pr_debug("Validating VMs failed, ret: %d\n", ret); <br>
+ goto validate_map_fail; <br>
+ } <br>
+ <br>
/* Validate BOs and map them to GPUVM (update VM page tables). */ <br>
list_for_each_entry(mem, &process_info->kfd_bo_list, <br>
validate_list) { <o:p></o:p></p>
</blockquote>
<p class="MsoNormal"><o:p> </o:p></p>
</blockquote>
</div>
</div>
</div>
</body>
</html>