[PATCH] drm/amdgpu: try again kiq access if not in IRQ(v2)

Mon Mar 5 07:57:29 UTC 2018

> otherwise I don't see how it is better by reverting it
Well it's better to revert it for now because it seems to create more 
problems than it solves.

To better approach this issue I suggest to do the following:
1. Revert the original patch.

2. Stop waiting to long for writes. E.g. use a separate timeout (20ms 
maybe?) to wait for the write. Then do a WARN_ON_ONCE when we timeout.

3. To the read function add a "if (!in_intterupt()) may_sleep();" and 
then retest. That should at least print a nice warning when called from 
atomic context.

4. Test the whole thing and try to fix all warnings about atomic 
contexts from the may_sleep();

5. Reapply the original patch, but this time only for the read function, 
not the write function.

Regards,
Christian.

Am 05.03.2018 um 05:20 schrieb Liu, Monk:
> When there are 16 VF/VM on one GPU, we can easily hit "sys lockup to 22s" kernel error/warning introduced by kiq_rreg/wreg routine
> That's why I must use this patch to let thread sleep a while and try again,
>
> If you insist reverting this patch please give me a solution, otherwise I don't see how it is better by reverting it
>
> /Monk
>
> -----Original Message-----
> From: Christian König [mailto:ckoenig.leichtzumerken at gmail.com]
> Sent: 2018年3月3日 21:38
> To: Kuehling, Felix <Felix.Kuehling at amd.com>; Liu, Monk <Monk.Liu at amd.com>; amd-gfx at lists.freedesktop.org
> Subject: Re: [PATCH] drm/amdgpu: try again kiq access if not in IRQ(v2)
>
> Am 02.03.2018 um 21:47 schrieb Felix Kuehling:
>> On 2018-03-02 04:29 AM, Liu, Monk wrote:
>>> In_atomic() isnot encouraged to be used to judge if sleep is
>>> possible, see the macros of it
>>>
>>> #define in_atomic() (preept_count() != 0)
>> OK. But my point is still that you're not testing the right thing when
>> you check in_interrupt(). The comment before the in_atomic macro
>> definition states the limitations and says "do not use in driver code".
>> Unfortunately it doesn't suggest any alternative. I think in_interrupt
>> is actually worse, because it misses even more cases than in_atomic.
> Thinking about this, Felix seems to be absolutely right.
>
> So we need to revert this patch since you can't reliable detect in a driver if sleeping is allowed or not.
>
> Regards,
> Christian.
>
>> Regards,
>>     Felix
>>
>>> /Monk
>>>
>>> -----Original Message-----
>>> From: Kuehling, Felix
>>> Sent: 2018年3月1日 23:50
>>> To: amd-gfx at lists.freedesktop.org; Liu, Monk <Monk.Liu at amd.com>
>>> Subject: Re: [PATCH] drm/amdgpu: try again kiq access if not in
>>> IRQ(v2)
>>>
>>> On 2018-02-28 02:27 AM, Monk Liu wrote:
>>>> sometimes GPU is switched to other VFs and won't swich back soon, so
>>>> the kiq reg access will not signal within a short period, instead of
>>>> busy waiting a long time(MAX_KEQ_REG_WAIT) and returning TMO we can
>>>> istead sleep 5ms and try again later (non irq context)
>>>>
>>>> And since the waiting in kiq_r/weg is busy wait, so MAX_KIQ_REG_WAIT
>>>> shouldn't set to a long time, set it to 10ms is more appropriate.
>>>>
>>>> if gpu already in reset state, don't retry the KIQ reg access
>>>> otherwise it would always hang because KIQ was already die usually.
>>>>
>>>> v2:
>>>> replace schedule() with msleep() for the wait
>>>>
>>>> Change-Id: I8fc807ce85a8d30d2b50153f3f3a6eda344ef994
>>>> Signed-off-by: Monk Liu <Monk.Liu at amd.com>
>>>> ---
>>>>    drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c | 15 +++++++++++++--
>>>>    1 file changed, 13 insertions(+), 2 deletions(-)
>>>>
>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c
>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c
>>>> index b832651..1672f5b 100644
>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c
>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c
>>>> @@ -22,7 +22,7 @@
>>>>     */
>>>>    
>>>>    #include "amdgpu.h"
>>>> -#define MAX_KIQ_REG_WAIT	100000000 /* in usecs */
>>>> +#define MAX_KIQ_REG_WAIT	10000 /* in usecs, 10ms */
>>>>    
>>>>    uint64_t amdgpu_csa_vaddr(struct amdgpu_device *adev)  { @@ -152,9
>>>> +152,14 @@ uint32_t amdgpu_virt_kiq_rreg(struct amdgpu_device *adev,
>>>> +uint32_t reg)
>>>>    	amdgpu_ring_commit(ring);
>>>>    	spin_unlock_irqrestore(&kiq->ring_lock, flags);
>>>>    
>>>> +retry_read:
>>>>    	r = amdgpu_fence_wait_polling(ring, seq, MAX_KIQ_REG_WAIT);
>>>>    	if (r < 1) {
>>>>    		DRM_ERROR("wait for kiq fence error: %ld\n", r);
>>>> +		if (!in_interrupt() && !adev->in_gpu_reset) {
>>> You should check in_atomic here. Because it's invalid to sleep in atomic context (e.g. while holding a spin lock) even when not in an interrupt.
>>> This seems to happen a lot for indirect register access, e.g.
>>> soc15_pcie_rreg.
>>>
>>> Regards,
>>>     Felix
>>>
>>>> +			msleep(5);
>>>> +			goto retry_read;
>>>> +		}
>>>>    		return ~0;
>>>>    	}
>>>>    	val = adev->wb.wb[adev->virt.reg_val_offs];
>>>> @@ -179,9 +184,15 @@ void amdgpu_virt_kiq_wreg(struct amdgpu_device *adev, uint32_t reg, uint32_t v)
>>>>    	amdgpu_ring_commit(ring);
>>>>    	spin_unlock_irqrestore(&kiq->ring_lock, flags);
>>>>    
>>>> +retry_write:
>>>>    	r = amdgpu_fence_wait_polling(ring, seq, MAX_KIQ_REG_WAIT);
>>>> -	if (r < 1)
>>>> +	if (r < 1) {
>>>>    		DRM_ERROR("wait for kiq fence error: %ld\n", r);
>>>> +		if (!in_interrupt() && !adev->in_gpu_reset) {
>>>> +			msleep(5);
>>>> +			goto retry_write;
>>>> +		}
>>>> +	}
>>>>    }
>>>>    
>>>>    /**
>> _______________________________________________
>> amd-gfx mailing list
>> amd-gfx at lists.freedesktop.org
>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
> _______________________________________________
> amd-gfx mailing list
> amd-gfx at lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/amd-gfx