[PATCH v4 1/8] drm/amdgpu: Avoid accessing HW when suspending SW state

Andrey Grodzovsky Andrey.Grodzovsky at amd.com
Thu Sep 3 15:25:02 UTC 2020


On 9/2/20 5:56 PM, Bjorn Helgaas wrote:
> On Wed, Sep 02, 2020 at 02:42:03PM -0400, Andrey Grodzovsky wrote:
>> At this point the ASIC is already post reset by the HW/PSP
>> so the HW not in proper state to be configured for suspension,
>> some blocks might be even gated and so best is to avoid touching it.
>>
>> v2: Rename in_dpc to more meaningful name
>>
>> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky at amd.com>
>> Reviewed-by: Alex Deucher <alexander.deucher at amd.com>
>> ---
>>   drivers/gpu/drm/amd/amdgpu/amdgpu.h        |  1 +
>>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 38 ++++++++++++++++++++++++++++++
>>   drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c    |  6 +++++
>>   drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c    |  6 +++++
>>   drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c     | 18 ++++++++------
>>   drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c     |  3 +++
>>   6 files changed, 65 insertions(+), 7 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>> index c311a3c..b20354f 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>> @@ -992,6 +992,7 @@ struct amdgpu_device {
>>   	atomic_t			throttling_logging_enabled;
>>   	struct ratelimit_state		throttling_logging_rs;
>>   	uint32_t			ras_features;
>> +	bool                            in_pci_err_recovery;
>>   };
>>   
>>   static inline struct amdgpu_device *drm_to_adev(struct drm_device *ddev)
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> index 74a1c03..1fbf8a1 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> @@ -319,6 +319,9 @@ uint32_t amdgpu_mm_rreg(struct amdgpu_device *adev, uint32_t reg,
>>   {
>>   	uint32_t ret;
>>   
>> +	if (adev->in_pci_err_recovery)
>> +		return 0;
> I don't know the whole scheme of this, but this looks racy.
>
> It looks like the normal path through this function is the readl(),
> which I assume is an MMIO read from the PCI device.  If this is called
> after a PCI error occurs, but before amdgpu_pci_slot_reset() sets
> adev->in_pci_err_recovery, the readl() will return 0xffffffff.
>
> If this is called after amdgpu_pci_slot_reset() sets
> adev->in_pci_err_recovery, it will return 0.  Do you really want those
> two different cases?


This is not intended to to avoid hardware accessed by other threads when doing 
PCI recovery (answers also Denis's question) -
in slot reset callback we suspend and then resume the IP blocks so we can bring 
the SW and HW back to operational. For this
we first call IPs suspend and then resume. But the ASIC was already reset by the 
time we execute this code and so there
is a misalignment between the HW and the SW states. The HW is in clean 'fresh 
poweron or init' state while the SW is in running state.
So I want to align SW state with the HW state by calling suspend IPs sequence 
without touching the HW and so this flag is used to
skip all HW registers r/w while I do it. Regarding other threads which might 
access the registers this should not happen as I sopped all new internal
and external command submissions  and took GPU reset device lock so at least we 
are protected here same as during ordinary GPU reset.
Yes, it's not 100% proof and there still might be some accesses from other 
threads during this time and they will return 0xffffffff values.

Andrey


>
>>   	if (!(acc_flags & AMDGPU_REGS_NO_KIQ) && amdgpu_sriov_runtime(adev))
>>   		return amdgpu_kiq_rreg(adev, reg);
>> @@ -4773,7 +4809,9 @@ pci_ers_result_t amdgpu_pci_slot_reset(struct pci_dev *pdev)
>>   
>>   	pci_restore_state(pdev);
>>   
>> +	adev->in_pci_err_recovery = true;
>>   	r = amdgpu_device_ip_suspend(adev);
>> +	adev->in_pci_err_recovery = false;
>>   	if (r)
>>   		goto out;
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.freedesktop.org/archives/amd-gfx/attachments/20200903/13db963e/attachment.htm>


More information about the amd-gfx mailing list