[PATCH] drm/amdgpu: drop kiq access while in reset
Lazar, Lijo
lijo.lazar at amd.com
Mon Jun 24 12:24:32 UTC 2024
On 6/24/2024 5:31 PM, Christian König wrote:
> Am 24.06.24 um 13:57 schrieb Lazar, Lijo:
>> On 6/24/2024 5:19 PM, Christian König wrote:
>>> Am 24.06.24 um 11:52 schrieb Lazar, Lijo:
>>>> On 6/24/2024 3:08 PM, Christian König wrote:
>>>>> Am 24.06.24 um 08:34 schrieb Lazar, Lijo:
>>>>>> On 6/24/2024 12:01 PM, Vignesh Chander wrote:
>>>>>>> correctly handle the case when trylock fails when gpu is
>>>>>>> about to be reset by dropping the request instead of using mmio
>>>>>>>
>>>>>>> Signed-off-by: Vignesh Chander <Vignesh.Chander at amd.com>
>>>>>> Reviewed-by: Lijo Lazar <lijo.lazar at amd.com>
>>>>>>
>>>>>> Thanks,
>>>>>> Lijo
>>>>>>
>>>>>>> ---
>>>>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 38
>>>>>>> ++++++++++++----------
>>>>>>> 1 file changed, 21 insertions(+), 17 deletions(-)
>>>>>>>
>>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>>> index a7ce8280b17ce7..3cfd24699d691d 100644
>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>>> @@ -613,10 +613,11 @@ uint32_t amdgpu_device_rreg(struct
>>>>>>> amdgpu_device *adev,
>>>>>>> if ((reg * 4) < adev->rmmio_size) {
>>>>>>> if (!(acc_flags & AMDGPU_REGS_NO_KIQ) &&
>>>>>>> - amdgpu_sriov_runtime(adev) &&
>>>>>>> - down_read_trylock(&adev->reset_domain->sem)) {
>>>>>>> - ret = amdgpu_kiq_rreg(adev, reg, 0);
>>>>>>> - up_read(&adev->reset_domain->sem);
>>>>>>> + amdgpu_sriov_runtime(adev) {
>>>>>>> + if (down_read_trylock(&adev->reset_domain->sem)) {
>>>>>>> + ret = amdgpu_kiq_rreg(adev, reg, 0);
>>>>>>> + up_read(&adev->reset_domain->sem);
>>>>>>> + }
>>>>> What has actually changed here? As far as I can see that isn't
>>>>> functionally different to the existing code.
>>>>>
>>>> In earlier logic, if it fails to get the lock, it takes the 'else'
>>>> path.
>>>> The 'else' path is not meant for sriov/vf.
>>> Yeah, but the read or write is then just silently dropped.
>>>
>>> That can't be correct either.
>>>
>> These are void funcs. Moreover, the drops will happen if there is
>> concurrent access from another thread while a reset is going on. That is
>> expected and those accesses during a reset won't help anyway.
>
> Nope, Teddy has been working on that for a while as well.
This silent drop is already there in bare metal.
https://elixir.bootlin.com/linux/v6.10-rc5/source/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c#L738
>
> Trying to make those accesses while the reset is going on is wrong in
> the first place. What we need to do is to grab the reset lock in the
> higher level function so that the whole sequences of reads and writes
> are protected.
>
> What this logic here does is to use readl()/writel() from the reset
> thread itself. Dropping that is incorrect and could lead to broken reset.
This doesn't change anything for a reset thread. This fixes an already
broken path for sriov where it attempts a direct readl()/writel() if
taking the lock fails. That is even more broken.
Thanks,
Lijo
>
> So clear NAK from my side to this patch here.
>
> Regards,
> Christian.
>
>>
>> Thanks,
>> Lijo
>>
>>> Regards,
>>> Christian.
>>>
>>>> Thanks,
>>>> Lijo
>>>>
>>>>> Regards,
>>>>> Christian.
>>>>>
>>>>>>> } else {
>>>>>>> ret = readl(((void __iomem *)adev->rmmio) + (reg
>>>>>>> * 4));
>>>>>>> }
>>>>>>> @@ -681,10 +682,11 @@ uint32_t amdgpu_device_xcc_rreg(struct
>>>>>>> amdgpu_device *adev,
>>>>>>> &rlcg_flag)) {
>>>>>>> ret = amdgpu_virt_rlcg_reg_rw(adev, reg, 0,
>>>>>>> rlcg_flag,
>>>>>>> GET_INST(GC, xcc_id));
>>>>>>> } else if (!(acc_flags & AMDGPU_REGS_NO_KIQ) &&
>>>>>>> - amdgpu_sriov_runtime(adev) &&
>>>>>>> - down_read_trylock(&adev->reset_domain->sem)) {
>>>>>>> - ret = amdgpu_kiq_rreg(adev, reg, xcc_id);
>>>>>>> - up_read(&adev->reset_domain->sem);
>>>>>>> + amdgpu_sriov_runtime(adev) {
>>>>>>> + if (down_read_trylock(&adev->reset_domain->sem)) {
>>>>>>> + ret = amdgpu_kiq_rreg(adev, reg, xcc_id);
>>>>>>> + up_read(&adev->reset_domain->sem);
>>>>>>> + }
>>>>>>> } else {
>>>>>>> ret = readl(((void __iomem *)adev->rmmio) + (reg
>>>>>>> * 4));
>>>>>>> }
>>>>>>> @@ -740,10 +742,11 @@ void amdgpu_device_wreg(struct amdgpu_device
>>>>>>> *adev,
>>>>>>> if ((reg * 4) < adev->rmmio_size) {
>>>>>>> if (!(acc_flags & AMDGPU_REGS_NO_KIQ) &&
>>>>>>> - amdgpu_sriov_runtime(adev) &&
>>>>>>> - down_read_trylock(&adev->reset_domain->sem)) {
>>>>>>> - amdgpu_kiq_wreg(adev, reg, v, 0);
>>>>>>> - up_read(&adev->reset_domain->sem);
>>>>>>> + amdgpu_sriov_runtime(adev) {
>>>>>>> + if (down_read_trylock(&adev->reset_domain->sem)) {
>>>>>>> + amdgpu_kiq_wreg(adev, reg, v, 0);
>>>>>>> + up_read(&adev->reset_domain->sem);
>>>>>>> + }
>>>>>>> } else {
>>>>>>> writel(v, ((void __iomem *)adev->rmmio) + (reg *
>>>>>>> 4));
>>>>>>> }
>>>>>>> @@ -812,11 +815,12 @@ void amdgpu_device_xcc_wreg(struct
>>>>>>> amdgpu_device *adev,
>>>>>>> &rlcg_flag)) {
>>>>>>> amdgpu_virt_rlcg_reg_rw(adev, reg, v, rlcg_flag,
>>>>>>> GET_INST(GC, xcc_id));
>>>>>>> } else if (!(acc_flags & AMDGPU_REGS_NO_KIQ) &&
>>>>>>> - amdgpu_sriov_runtime(adev) &&
>>>>>>> - down_read_trylock(&adev->reset_domain->sem)) {
>>>>>>> - amdgpu_kiq_wreg(adev, reg, v, xcc_id);
>>>>>>> - up_read(&adev->reset_domain->sem);
>>>>>>> - } else {
>>>>>>> + amdgpu_sriov_runtime(adev) {
>>>>>>> + if (down_read_trylock(&adev->reset_domain->sem)) {
>>>>>>> + amdgpu_kiq_wreg(adev, reg, v, xcc_id);
>>>>>>> + up_read(&adev->reset_domain->sem);
>>>>>>> + }
>>>>>>> + } else {
>>>>>>> writel(v, ((void __iomem *)adev->rmmio) + (reg *
>>>>>>> 4));
>>>>>>> }
>>>>>>> } else {
>
More information about the amd-gfx
mailing list