<html><head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
</head>
<body text="#000000" bgcolor="#FFFFFF">
<p><br>
</p>
<div class="moz-cite-prefix">On 9/2/20 5:56 PM, Bjorn Helgaas wrote:<br>
</div>
<blockquote type="cite" cite="mid:20200902215612.GA271679@bjorn-Precision-5520">
<pre class="moz-quote-pre" wrap="">On Wed, Sep 02, 2020 at 02:42:03PM -0400, Andrey Grodzovsky wrote:
</pre>
<blockquote type="cite">
<pre class="moz-quote-pre" wrap="">At this point the ASIC is already post reset by the HW/PSP
so the HW not in proper state to be configured for suspension,
some blocks might be even gated and so best is to avoid touching it.
v2: Rename in_dpc to more meaningful name
Signed-off-by: Andrey Grodzovsky <a class="moz-txt-link-rfc2396E" href="mailto:andrey.grodzovsky@amd.com"><andrey.grodzovsky@amd.com></a>
Reviewed-by: Alex Deucher <a class="moz-txt-link-rfc2396E" href="mailto:alexander.deucher@amd.com"><alexander.deucher@amd.com></a>
---
drivers/gpu/drm/amd/amdgpu/amdgpu.h | 1 +
drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 38 ++++++++++++++++++++++++++++++
drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c | 6 +++++
drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c | 6 +++++
drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c | 18 ++++++++------
drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c | 3 +++
6 files changed, 65 insertions(+), 7 deletions(-)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
index c311a3c..b20354f 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
@@ -992,6 +992,7 @@ struct amdgpu_device {
atomic_t throttling_logging_enabled;
struct ratelimit_state throttling_logging_rs;
uint32_t ras_features;
+ bool in_pci_err_recovery;
};
static inline struct amdgpu_device *drm_to_adev(struct drm_device *ddev)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 74a1c03..1fbf8a1 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -319,6 +319,9 @@ uint32_t amdgpu_mm_rreg(struct amdgpu_device *adev, uint32_t reg,
{
uint32_t ret;
+ if (adev->in_pci_err_recovery)
+ return 0;
</pre>
</blockquote>
<pre class="moz-quote-pre" wrap="">
I don't know the whole scheme of this, but this looks racy.
It looks like the normal path through this function is the readl(),
which I assume is an MMIO read from the PCI device. If this is called
after a PCI error occurs, but before amdgpu_pci_slot_reset() sets
adev->in_pci_err_recovery, the readl() will return 0xffffffff.
If this is called after amdgpu_pci_slot_reset() sets
adev->in_pci_err_recovery, it will return 0. Do you really want those
two different cases?</pre>
</blockquote>
<p><br>
</p>
<p>This is not intended to to avoid hardware accessed by other
threads when doing PCI recovery (answers also Denis's question) -
<br>
in slot reset callback we suspend and then resume the IP blocks so
we can bring the SW and HW back to operational. For this<br>
we first call IPs suspend and then resume. But the ASIC was
already reset by the time we execute this code and so there<br>
is a misalignment between the HW and the SW states. The HW is in
clean '<span style="color: rgb(0, 0, 0); font-family: serif;
font-size: 16px; font-style: normal; font-variant-ligatures:
normal; font-variant-caps: normal; font-weight: 400;
letter-spacing: normal; orphans: 2; text-align: start;
text-indent: 0px; text-transform: none; white-space: normal;
widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px;
background-color: rgb(252, 252, 252); text-decoration-style:
initial; text-decoration-color: initial; display: inline
!important; float: none;">fresh poweron</span> or init' state
while the SW is in running state.<br>
So I want to align SW state with the HW state by calling suspend
IPs sequence without touching the HW and so this flag is used to<br>
skip all HW registers r/w while I do it. Regarding other threads
which might access the registers this should not happen as I
sopped all new internal<br>
and external command submissions and took GPU reset device lock
so at least we are protected here same as during ordinary GPU
reset.<br>
Yes, it's not 100% proof and there still might be some accesses
from other threads during this time and they will return
0xffffffff values. <br>
</p>
<p>Andrey</p>
<p><br>
</p>
<blockquote type="cite" cite="mid:20200902215612.GA271679@bjorn-Precision-5520">
<pre class="moz-quote-pre" wrap="">
</pre>
<blockquote type="cite">
<pre class="moz-quote-pre" wrap=""> if (!(acc_flags & AMDGPU_REGS_NO_KIQ) && amdgpu_sriov_runtime(adev))
return amdgpu_kiq_rreg(adev, reg);
</pre>
</blockquote>
<pre class="moz-quote-pre" wrap="">
</pre>
<blockquote type="cite">
<pre class="moz-quote-pre" wrap="">@@ -4773,7 +4809,9 @@ pci_ers_result_t amdgpu_pci_slot_reset(struct pci_dev *pdev)
pci_restore_state(pdev);
+ adev->in_pci_err_recovery = true;
r = amdgpu_device_ip_suspend(adev);
+ adev->in_pci_err_recovery = false;
if (r)
goto out;
</pre>
</blockquote>
</blockquote>
</body>
</html>