<html> <head> <meta http-equiv="Content-Type" content="text/html; charset=utf-8"> <style type="text/css" style="display:none;"></style> </head> <body dir="ltr"> <style type="text/css" style="display:none;"></style> <div id="divtagdefaultwrapper" style="font-size:12pt;color:#000000;font-family:Calibri,Arial,Helvetica,sans-serif;" dir="ltr"> we will send another patch to fix it later that using ->hw_init instead of ->resume . BR Monk </div> <hr style="display:inline-block;width:98%" tabindex="-1"> <div id="divRplyFwdMsg" dir="ltr">发件人: amd-gfx <amd-gfx-bounces@lists.freedesktop.org> 代表 Liu, Monk <Monk.Liu@amd.com> 发送时间: 2017年2月8日 23:40:35 收件人: Christian König; amd-gfx@lists.freedesktop.org 主题: 答复: 答复: 答复: SPAM //答复: [PATCH 09/20] drm/amdgpu:implement SRIOV gpu_reset <div> </div> </div> <div> <div id="divtagdefaultwrapper" style="font-size:12pt;color:#000000;font-family:Calibri,Arial,Helvetica,sans-serif;" dir="ltr"> If I understand you correct you really don't need all that reload and restore dance. Instead what you need for the SRIOV case is just reinitializing the hardware, isn't it? That this works is just pure coincident because we don't have backup/restore function for the blocks enabled for SRIOV. ML： yeah, I use amdgpu_sriov_resume_early/late because according to the current code sequence there is nothing bad introduced without invoking suspend first. call hw_init on each IP should behaves the same as my patch. I misunderstand you that I thought you insist invoking suspend() and resume() in pair .. I agree use hw_init without suspend is more reasonable thanks <div style="color: rgb(0, 0, 0);"> <hr tabindex="-1" style="display:inline-block; width:98%"> <div id="divRplyFwdMsg" dir="ltr">发件人: Christian König <deathsimple@vodafone.de> 发送时间: 2017年2月8日下午 11:27 收件人: Liu, Monk; amd-gfx@lists.freedesktop.org 主题: Re: 答复: 答复: SPAM //答复: [PATCH 09/20] drm/amdgpu:implement SRIOV gpu_reset <div> </div> </div> <div> <div class="moz-cite-prefix"> <blockquote type="cite"> do you mean turn VM into s3 mode ? if so we not get that step, the S3 suspend/resume function is not supported by SRIOV vf case. </blockquote> Then just try to unbind the fb and unload the module. The idea of the suspend/resume callbacks are that they are only called in pairs. See suspend is supposed to unpin all BOs and backup all resources from VRAM to GTT and preserve the general hardware state. Now what resume does is to reload BOs and restore the state previously saved during the suspend backup. If I understand you correct you really don't need all that reload and restore dance. Instead what you need for the SRIOV case is just reinitializing the hardware, isn't it? That this works is just pure coincident because we don't have backup/restore function for the blocks enabled for SRIOV. Regards, Christian. Am 08.02.2017 um 16:17 schrieb Liu, Monk: </div> <blockquote type="cite"> <div id="divtagdefaultwrapper" style="font-size:12pt; color:#000000; font-family:Calibri,Arial,Helvetica,sans-serif" dir="ltr"> wait a minutes ... 》3. suspend the VM with glxgears running do you mean turn VM into s3 mode ? if so we not get that step, the S3 suspend/resume function is not supported by SRIOV vf case. </div> <hr tabindex="-1" style="display:inline-block; width:98%"> <div id="divRplyFwdMsg" dir="ltr">发件人: Christian König <a class="moz-txt-link-rfc2396E" href="mailto:deathsimple@vodafone.de"><deathsimple@vodafone.de></a> 发送时间: 2017年2月8日 23:13:46 收件人: Liu, Monk; <a class="moz-txt-link-abbreviated" href="mailto:amd-gfx@lists.freedesktop.org"> amd-gfx@lists.freedesktop.org</a> 主题: Re: 答复: SPAM //答复: [PATCH 09/20] drm/amdgpu:implement SRIOV gpu_reset <div> </div> </div> <div> <div class="moz-cite-prefix"> <blockquote type="cite">and like I said, this approach is correct and verified by hang test</blockquote> Completely irrelevant. Please try the following: 1. Trigger a hang 2. Reset the GPU 3. Suspend the VM with glxgears running 4. Resume the VM I'm pretty sure that with your approach that either suspending or resuming the VM with an application running will just hang. Anyway even if all the code path you break here (UVD/VCE at minimum) are disabled in the SRIOV case it's still not a good idea completely breaking the design just to satisfy the SRIOV feature. So that is still a clear NAK on that. Please do as I told you and use the hw_init() callback instead, it is especially made for this use case. Regards, Christian. Am 08.02.2017 um 15:57 schrieb Liu, Monk: </div> <blockquote type="cite"><style style="display:none" type="text/css">  </style> <div id="divtagdefaultwrapper" style="font-size:12pt; color:#000000; font-family:Calibri,Arial,Helvetica,sans-serif" dir="ltr"> 》As I wrote in the other thread as well calling amdgpu_resume() without proper suspend will just mess up a whole bunch of internal structures. <div> </div> <div>please name at least one, I'll check and see how to improve</div> <div> </div> <div>and like I said, this approach is correct and verified by hang test, if internal structures messed up I don't think the test will easy pass. hw_init() just call resume per engine.</div> <div> </div> <div>you can take a deep look into sriov_gpu_reset and judge later </div> <div> </div> </div> <hr tabindex="-1" style="display:inline-block; width:98%"> <div id="divRplyFwdMsg" dir="ltr">发件人: Christian König <a class="moz-txt-link-rfc2396E" href="mailto:deathsimple@vodafone.de"><deathsimple@vodafone.de></a> 发送时间: 2017年2月8日 18:49:57 收件人: Liu, Monk; <a class="moz-txt-link-abbreviated" href="mailto:amd-gfx@lists.freedesktop.org"> amd-gfx@lists.freedesktop.org</a> 主题: Re: SPAM //答复: [PATCH 09/20] drm/amdgpu:implement SRIOV gpu_reset <div> </div> </div> <div> <div class="moz-cite-prefix"> <blockquote type="cite"> <div class="PlainText">+ /* now we are okay to resume SMC/CP/SDMA */ + amdgpu_resume_late(adev); </div> </blockquote> As I wrote in the other thread as well calling amdgpu_resume() without proper suspend will just mess up a whole bunch of internal structures. So a clear NAK on that approach. If you don't need the hw stop which amdgpu_suspend() does for SRIOV then please try to just use the hw_init() callback and not the resume() callback. Regards, Christian. Am 07.02.2017 um 07:26 schrieb Liu, Monk: </div> <blockquote type="cite"> <meta name="Generator" content="Microsoft Exchange Server"> <style>  </style> <meta content="text/html; charset=UTF-8"> <style type="text/css">  </style> <div dir="ltr"> <div id="x_divtagdefaultwrapper" style="font-size:12pt; color:#000000; font-family:Calibri,Arial,Helvetica,sans-serif" dir="ltr"> patch 1-8 are some fixes for sriov gpu reset feature patch 9 -20 are for sriov gpu reset BR Monk </div> <hr tabindex="-1" style="display:inline-block; width:98%"> <div id="x_divRplyFwdMsg" dir="ltr">发件人: amd-gfx <a class="moz-txt-link-rfc2396E" href="mailto:amd-gfx-bounces@lists.freedesktop.org"> <amd-gfx-bounces@lists.freedesktop.org></a> 代表 Monk Liu <a class="moz-txt-link-rfc2396E" href="mailto:Monk.Liu@amd.com"> <Monk.Liu@amd.com></a> 发送时间: 2017年2月7日 14:11:07 收件人: <a class="moz-txt-link-abbreviated" href="mailto:amd-gfx@lists.freedesktop.org"> amd-gfx@lists.freedesktop.org</a> 抄送: Liu, Monk 主题: [PATCH 09/20] drm/amdgpu:implement SRIOV gpu_reset <div> </div> </div> </div> <div class="PlainText">Signed-off-by: Monk Liu <a class="moz-txt-link-rfc2396E" href="mailto:Monk.Liu@amd.com"> <Monk.Liu@amd.com></a> --- drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 158 ++++++++++++++++++++++++++++- drivers/gpu/drm/amd/amdgpu/amdgpu_virt.h | 1 + 2 files changed, 158 insertions(+), 1 deletion(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c index e926f84..2b404ca 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c @@ -1604,6 +1604,53 @@ int amdgpu_suspend(struct amdgpu_device *adev) return 0; } +static int amdgpu_resume_early(struct amdgpu_device *adev) +{ + int i, r; + + for (i = 0; i < adev->num_ip_blocks; i++) { + if (!adev->ip_blocks[i].status.valid) + continue; + + if (adev->ip_blocks[i].version->type == AMD_IP_BLOCK_TYPE_COMMON || + adev->ip_blocks[i].version->type == AMD_IP_BLOCK_TYPE_GMC || + adev->ip_blocks[i].version->type == AMD_IP_BLOCK_TYPE_IH) + r = adev->ip_blocks[i].version->funcs->resume(adev); + + if (r) { + DRM_ERROR("resume of IP block <%s> failed %d\n", + adev->ip_blocks[i].version->funcs->name, r); + return r; + } + } + + return 0; +} + +static int amdgpu_resume_late(struct amdgpu_device *adev) +{ + int i, r; + + for (i = 0; i < adev->num_ip_blocks; i++) { + if (!adev->ip_blocks[i].status.valid) + continue; + + if (adev->ip_blocks[i].version->type == AMD_IP_BLOCK_TYPE_COMMON || + adev->ip_blocks[i].version->type == AMD_IP_BLOCK_TYPE_GMC || + adev->ip_blocks[i].version->type == AMD_IP_BLOCK_TYPE_IH ) + continue; + + r = adev->ip_blocks[i].version->funcs->resume(adev); + if (r) { + DRM_ERROR("resume of IP block <%s> failed %d\n", + adev->ip_blocks[i].version->funcs->name, r); + return r; + } + } + + return 0; +} + static int amdgpu_resume(struct amdgpu_device *adev) { int i, r; @@ -2343,6 +2390,115 @@ static int amdgpu_recover_vram_from_shadow(struct amdgpu_device *adev, } /** + * amdgpu_sriov_gpu_reset - reset the asic + * + * @adev: amdgpu device pointer + * @voluntary: if this reset is requested by guest. + * (true means by guest and false means by HYPERVISOR ) + * + * Attempt the reset the GPU if it has hung (all asics). + * for SRIOV case. + * Returns 0 for success or an error on failure. + */ +int amdgpu_sriov_gpu_reset(struct amdgpu_device *adev, bool voluntary) +{ + int i, r = 0; + int resched; + struct amdgpu_bo *bo, *tmp; + struct amdgpu_ring *ring; + struct fence *fence = NULL, *next = NULL; + + mutex_lock(&adev->virt.lock_reset); + atomic_inc(&adev->gpu_reset_counter); + + /* block TTM */ + resched = ttm_bo_lock_delayed_workqueue(&adev->mman.bdev); + + /* block scheduler */ + for (i = 0; i < AMDGPU_MAX_RINGS; ++i) { + ring = adev->rings[i]; + + if (!ring || !ring->sched.thread) + continue; + + kthread_park(ring->sched.thread); + amd_sched_hw_job_reset(&ring->sched); + } + + /* after all hw jobs are reset, hw fence is meaningless, so force_completion */ + amdgpu_fence_driver_force_completion(adev); + + /* request to take full control of GPU before re-initialization */ + if (voluntary) + amdgpu_virt_reset_gpu(adev); + else + amdgpu_virt_request_full_gpu(adev, true); + + + /* Resume IP prior to SMC */ + amdgpu_resume_early(adev); + + /* we need recover gart prior to run SMC/CP/SDMA resume */ + amdgpu_ttm_recover_gart(adev); + + /* now we are okay to resume SMC/CP/SDMA */ + amdgpu_resume_late(adev); </div> </blockquote> <blockquote type="cite"> <div class="PlainText">+ + amdgpu_irq_gpu_reset_resume_helper(adev); + + if (amdgpu_ib_ring_tests(adev)) + dev_err(adev->dev, "[GPU_RESET] ib ring test failed (%d).\n", r); + + /* rellease full control of GPU after ib test */ + amdgpu_virt_release_full_gpu(adev, true); + + DRM_INFO("recover vram bo from shadow\n"); + + ring = adev->mman.buffer_funcs_ring; + mutex_lock(&adev->shadow_list_lock); + list_for_each_entry_safe(bo, tmp, &adev->shadow_list, shadow_list) { + amdgpu_recover_vram_from_shadow(adev, ring, bo, &next); + if (fence) { + r = fence_wait(fence, false); + if (r) { + WARN(r, "recovery from shadow isn't completed\n"); + break; + } + } + + fence_put(fence); + fence = next; + } + mutex_unlock(&adev->shadow_list_lock); + + if (fence) { + r = fence_wait(fence, false); + if (r) + WARN(r, "recovery from shadow isn't completed\n"); + } + fence_put(fence); + + for (i = 0; i < AMDGPU_MAX_RINGS; ++i) { + struct amdgpu_ring *ring = adev->rings[i]; + if (!ring || !ring->sched.thread) + continue; + + amd_sched_job_recovery(&ring->sched); + kthread_unpark(ring->sched.thread); + } + + drm_helper_resume_force_mode(adev->ddev); + ttm_bo_unlock_delayed_workqueue(&adev->mman.bdev, resched); + if (r) { + /* bad news, how to tell it to userspace ? */ + dev_info(adev->dev, "GPU reset failed\n"); + } + + mutex_unlock(&adev->virt.lock_reset); + return r; +} + +/** * amdgpu_gpu_reset - reset the asic * * @adev: amdgpu device pointer @@ -2358,7 +2514,7 @@ int amdgpu_gpu_reset(struct amdgpu_device *adev) bool need_full_reset; if (amdgpu_sriov_vf(adev)) - return 0; + return amdgpu_sriov_gpu_reset(adev, true); if (!amdgpu_check_soft_reset(adev)) { DRM_INFO("No hardware hang detected. Did some blocks stall?\n"); diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.h index 675e12c..73d24df 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.h +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.h @@ -89,5 +89,6 @@ void amdgpu_virt_kiq_wreg(struct amdgpu_device *adev, uint32_t reg, uint32_t v); int amdgpu_virt_request_full_gpu(struct amdgpu_device *adev, bool init); int amdgpu_virt_release_full_gpu(struct amdgpu_device *adev, bool init); int amdgpu_virt_reset_gpu(struct amdgpu_device *adev); +int amdgpu_sriov_gpu_reset(struct amdgpu_device *adev, bool voluntary); #endif -- 2.7.4 _______________________________________________ amd-gfx mailing list <a class="moz-txt-link-abbreviated" href="mailto:amd-gfx@lists.freedesktop.org">amd-gfx@lists.freedesktop.org</a> <a href="https://lists.freedesktop.org/mailman/listinfo/amd-gfx">https://lists.freedesktop.org/mailman/listinfo/amd-gfx</a> </div> <fieldset class="mimeAttachmentHeader"></fieldset> <pre>_______________________________________________ amd-gfx mailing list <a class="moz-txt-link-abbreviated" href="mailto:amd-gfx@lists.freedesktop.org">amd-gfx@lists.freedesktop.org</a> <a class="moz-txt-link-freetext" href="https://lists.freedesktop.org/mailman/listinfo/amd-gfx">https://lists.freedesktop.org/mailman/listinfo/amd-gfx</a> </pre> </blockquote> </div> </blockquote> </div> <fieldset class="mimeAttachmentHeader"></fieldset> <pre>_______________________________________________ amd-gfx mailing list <a class="moz-txt-link-abbreviated" href="mailto:amd-gfx@lists.freedesktop.org">amd-gfx@lists.freedesktop.org</a> <a class="moz-txt-link-freetext" href="https://lists.freedesktop.org/mailman/listinfo/amd-gfx">https://lists.freedesktop.org/mailman/listinfo/amd-gfx</a> </pre> </blockquote> </div> </div> </div> </div> </body> </html>