[PATCH] drm/amdgpu: capture invalid hardware access v2

Tue Mar 9 19:13:08 UTC 2021

On 2021-03-09 1:51 p.m., Christian König wrote:
> Am 09.03.21 um 19:26 schrieb Andrey Grodzovsky:
>> On 2021-03-09 12:47 p.m., Christian König wrote:
>>> No it won't. Accessing the hardware without the lock is ok as long as 
>>> the write side isn't taken.
>>
>> Oh, forgot about the trylock part, sorry...
>>
>>>
>>> But that approach is illegal anyway because we suspend the hardware 
>>> without proper protection from concurrent access.
>>
>> For my understanding and from looking again at his steps related to this
>>
>> Step 0: atomic_cmpxchg(&adev->in_gpu_reset, 0, 1) - [AG] protects from 
>> other TDR threads
>>
>> Step 1: cancel all delay works, stop drm schedule, complete all 
>> unreceived fences and so on. Call amdgpu_device_pre_asic_reset... e.t.c
>> - [AG] this is the HW suspend part
>>
>> Step 2: call down_write(&adev->reset_sem) to hold write lock, which will
>> block recovery thread until other threads release read locks.
>>
>> Is the problem here that HW is suspended while some other threads
>> that rely on the read side lock still access HW ?
> 
> Exactly that, yes.
> 
>> Mostly what I am
>> thinking about are IOCTls - we can't 'wait for them to complete' but
>> they might be accessing HW when we start suspend.
> 
> Well, the bigger problem with the IOCTLs is that I can't seem to be able 
> to properly explain the dependency here.
> 
> Maybe let me describe it from the other side once more to make it more 
> understandable:
> 
> Core Linux memory management depends on waiting for dma_fences. Amdgpus 
> dma_fences in turn depend on the GPU scheduler and recovery/reset 
> functionality for proper functionality.
> 
> So every call to kmalloc() or get_free_page() depends on the amdgpu GPU 
> recover/reset function and with it every lock taken in that function.
> 
> Because of this you can't take any lock under which memory allocation 
> might happen while holding the reset lock.
> 
> This is a hard and by now very well documented limitation and Dennis 
> changes here doesn't magically remove that.
> 
> Because of this taking the reset lock or any other lock used in the GPU 
> reset functionality at the beginning of our IOCTLs will never work 
> correctly.
> 
> What you need to do is to have all locks which depend on MM taken before 
> your take the reset lock. E.g. dma_resv, mmap_sem, VM locks etc etc.. As 
> well as not call functions like kmalloc, get_free_page, dma_fence_wait 
> etc etc
> 
> I hope that somehow makes it more clear now. We should probably try to 
> merge Daniels patch set for teaching lockdep to complain about such 
> things sooner or later.

It is now more clear then in the previous discussion. One
thing that would be helpful is also to see the kernel documentation
about this limitation - can you maybe point in kernel DOC or even in
code to such documentation ?

Andrey

> 
> With those patches merged lockdep starts to complain about the problems 
> as soon as you try to run a kernel with that enabled.
> 
> Regards,
> Christian.
> 
>>
>> Andrey
>>
>>
>>>
>>> Christian.
>>>
>>> Am 09.03.21 um 17:40 schrieb Andrey Grodzovsky:
>>>> Because he takes the write side lock post amdgpu_pre_asic_reset - 
>>>> where HW suspend sequence happens (touching registers) - so i think 
>>>> it will assert.
>>>>
>>>> Andrey
>>>>
>>>> On 2021-03-09 7:56 a.m., Christian König wrote:
>>>>> Hi Dennis,
>>>>>
>>>>> why do you think that this will always assert in reset thread?
>>>>>
>>>>> In the reset thread while we are holding the reset lock write side 
>>>>> lockdep_assert_held() should be satisfied and not cause any splat 
>>>>> in the system log.
>>>>>
>>>>> Regards,
>>>>> Christian.
>>>>>
>>>>> Am 09.03.21 um 03:03 schrieb Li, Dennis:
>>>>>> [AMD Official Use Only - Internal Distribution Only]
>>>>>>
>>>>>> Hi, Christian,
>>>>>>         amdgpu_device_skip_hw_access will always assert in reset 
>>>>>> thread, which seems not a good idea.
>>>>>>
>>>>>> Best Regards
>>>>>> Dennis Li
>>>>>> -----Original Message-----
>>>>>> From: Christian König <ckoenig.leichtzumerken at gmail.com>
>>>>>> Sent: Tuesday, March 9, 2021 2:07 AM
>>>>>> To: amd-gfx at lists.freedesktop.org
>>>>>> Cc: Grodzovsky, Andrey <Andrey.Grodzovsky at amd.com>; Li, Dennis 
>>>>>> <Dennis.Li at amd.com>
>>>>>> Subject: [PATCH] drm/amdgpu: capture invalid hardware access v2
>>>>>>
>>>>>> From: Dennis Li <Dennis.Li at amd.com>
>>>>>>
>>>>>> When recovery thread has begun GPU reset, there should be not 
>>>>>> other threads to access hardware, otherwise system randomly hang.
>>>>>>
>>>>>> v2 (chk): rewritten from scratch, use trylock and lockdep instead of
>>>>>>    hand wiring the logic.
>>>>>>
>>>>>> Signed-off-by: Dennis Li <Dennis.Li at amd.com>
>>>>>> Signed-off-by: Christian König <christian.koenig at amd.com>
>>>>>> ---
>>>>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 74 
>>>>>> +++++++++++++++++-----
>>>>>>   1 file changed, 57 insertions(+), 17 deletions(-)
>>>>>>
>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
>>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>> index e247c3a2ec08..c990af6a43ca 100644
>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>> @@ -326,6 +326,34 @@ void amdgpu_device_vram_access(struct 
>>>>>> amdgpu_device *adev, loff_t pos,
>>>>>>   /*
>>>>>>    * register access helper functions.
>>>>>>    */
>>>>>> +
>>>>>> +/* Check if hw access should be skipped because of hotplug or device
>>>>>> +error */ static bool amdgpu_device_skip_hw_access(struct 
>>>>>> amdgpu_device
>>>>>> +*adev) {
>>>>>> +if (adev->in_pci_err_recovery)
>>>>>> +return true;
>>>>>> +
>>>>>> +#ifdef CONFIG_LOCKDEP
>>>>>> +/*
>>>>>> + * This is a bit complicated to understand, so worth a comment. 
>>>>>> What we assert
>>>>>> + * here is that the GPU reset is not running on another thread in 
>>>>>> parallel.
>>>>>> + *
>>>>>> + * For this we trylock the read side of the reset semaphore, if 
>>>>>> that succeeds
>>>>>> + * we know that the reset is not running in paralell.
>>>>>> + *
>>>>>> + * If the trylock fails we assert that we are either already 
>>>>>> holding the read
>>>>>> + * side of the lock or are the reset thread itself and hold the 
>>>>>> write side of
>>>>>> + * the lock.
>>>>>> + */
>>>>>> +if (down_read_trylock(&adev->reset_sem))
>>>>>> +up_read(&adev->reset_sem);
>>>>>> +else
>>>>>> +lockdep_assert_held(&adev->reset_sem);
>>>>>> +#endif
>>>>>> +
>>>>>> +return false;
>>>>>> +}
>>>>>> +
>>>>>>   /**
>>>>>>    * amdgpu_device_rreg - read a memory mapped IO or indirect 
>>>>>> register
>>>>>>    *
>>>>>> @@ -340,7 +368,7 @@ uint32_t amdgpu_device_rreg(struct 
>>>>>> amdgpu_device *adev,  {
>>>>>>   uint32_t ret;
>>>>>>
>>>>>> -if (adev->in_pci_err_recovery)
>>>>>> +if (amdgpu_device_skip_hw_access(adev))
>>>>>>   return 0;
>>>>>>
>>>>>>   if ((reg * 4) < adev->rmmio_size) {
>>>>>> @@ -377,7 +405,7 @@ uint32_t amdgpu_device_rreg(struct 
>>>>>> amdgpu_device *adev,
>>>>>>    */
>>>>>>   uint8_t amdgpu_mm_rreg8(struct amdgpu_device *adev, uint32_t 
>>>>>> offset)  {
>>>>>> -if (adev->in_pci_err_recovery)
>>>>>> +if (amdgpu_device_skip_hw_access(adev))
>>>>>>   return 0;
>>>>>>
>>>>>>   if (offset < adev->rmmio_size)
>>>>>> @@ -402,7 +430,7 @@ uint8_t amdgpu_mm_rreg8(struct amdgpu_device 
>>>>>> *adev, uint32_t offset)
>>>>>>    */
>>>>>>   void amdgpu_mm_wreg8(struct amdgpu_device *adev, uint32_t 
>>>>>> offset, uint8_t value)  {
>>>>>> -if (adev->in_pci_err_recovery)
>>>>>> +if (amdgpu_device_skip_hw_access(adev))
>>>>>>   return;
>>>>>>
>>>>>>   if (offset < adev->rmmio_size)
>>>>>> @@ -425,7 +453,7 @@ void amdgpu_device_wreg(struct amdgpu_device 
>>>>>> *adev,
>>>>>>   uint32_t reg, uint32_t v,
>>>>>>   uint32_t acc_flags)
>>>>>>   {
>>>>>> -if (adev->in_pci_err_recovery)
>>>>>> +if (amdgpu_device_skip_hw_access(adev))
>>>>>>   return;
>>>>>>
>>>>>>   if ((reg * 4) < adev->rmmio_size) {
>>>>>> @@ -452,7 +480,7 @@ void amdgpu_device_wreg(struct amdgpu_device 
>>>>>> *adev,  void amdgpu_mm_wreg_mmio_rlc(struct amdgpu_device *adev,
>>>>>>        uint32_t reg, uint32_t v)
>>>>>>   {
>>>>>> -if (adev->in_pci_err_recovery)
>>>>>> +if (amdgpu_device_skip_hw_access(adev))
>>>>>>   return;
>>>>>>
>>>>>>   if (amdgpu_sriov_fullaccess(adev) &&
>>>>>> @@ -475,7 +503,7 @@ void amdgpu_mm_wreg_mmio_rlc(struct 
>>>>>> amdgpu_device *adev,
>>>>>>    */
>>>>>>   u32 amdgpu_io_rreg(struct amdgpu_device *adev, u32 reg) {
>>>>>> -if (adev->in_pci_err_recovery)
>>>>>> +if (amdgpu_device_skip_hw_access(adev))
>>>>>>   return 0;
>>>>>>
>>>>>>   if ((reg * 4) < adev->rio_mem_size)
>>>>>> @@ -497,7 +525,7 @@ u32 amdgpu_io_rreg(struct amdgpu_device *adev, 
>>>>>> u32 reg)
>>>>>>    */
>>>>>>   void amdgpu_io_wreg(struct amdgpu_device *adev, u32 reg, u32 v)  {
>>>>>> -if (adev->in_pci_err_recovery)
>>>>>> +if (amdgpu_device_skip_hw_access(adev))
>>>>>>   return;
>>>>>>
>>>>>>   if ((reg * 4) < adev->rio_mem_size)
>>>>>> @@ -519,7 +547,7 @@ void amdgpu_io_wreg(struct amdgpu_device 
>>>>>> *adev, u32 reg, u32 v)
>>>>>>    */
>>>>>>   u32 amdgpu_mm_rdoorbell(struct amdgpu_device *adev, u32 index)  {
>>>>>> -if (adev->in_pci_err_recovery)
>>>>>> +if (amdgpu_device_skip_hw_access(adev))
>>>>>>   return 0;
>>>>>>
>>>>>>   if (index < adev->doorbell.num_doorbells) { @@ -542,7 +570,7 @@ 
>>>>>> u32 amdgpu_mm_rdoorbell(struct amdgpu_device *adev, u32 index)
>>>>>>    */
>>>>>>   void amdgpu_mm_wdoorbell(struct amdgpu_device *adev, u32 index, 
>>>>>> u32 v)  {
>>>>>> -if (adev->in_pci_err_recovery)
>>>>>> +if (amdgpu_device_skip_hw_access(adev))
>>>>>>   return;
>>>>>>
>>>>>>   if (index < adev->doorbell.num_doorbells) { @@ -563,7 +591,7 @@ 
>>>>>> void amdgpu_mm_wdoorbell(struct amdgpu_device *adev, u32 index, 
>>>>>> u32 v)
>>>>>>    */
>>>>>>   u64 amdgpu_mm_rdoorbell64(struct amdgpu_device *adev, u32 index)  {
>>>>>> -if (adev->in_pci_err_recovery)
>>>>>> +if (amdgpu_device_skip_hw_access(adev))
>>>>>>   return 0;
>>>>>>
>>>>>>   if (index < adev->doorbell.num_doorbells) { @@ -586,7 +614,7 @@ 
>>>>>> u64 amdgpu_mm_rdoorbell64(struct amdgpu_device *adev, u32 index)
>>>>>>    */
>>>>>>   void amdgpu_mm_wdoorbell64(struct amdgpu_device *adev, u32 
>>>>>> index, u64 v)  {
>>>>>> -if (adev->in_pci_err_recovery)
>>>>>> +if (amdgpu_device_skip_hw_access(adev))
>>>>>>   return;
>>>>>>
>>>>>>   if (index < adev->doorbell.num_doorbells) { @@ -610,10 +638,13 
>>>>>> @@ u32 amdgpu_device_indirect_rreg(struct amdgpu_device *adev,
>>>>>>   u32 pcie_index, u32 pcie_data,
>>>>>>   u32 reg_addr)
>>>>>>   {
>>>>>> -unsigned long flags;
>>>>>> -u32 r;
>>>>>>   void __iomem *pcie_index_offset;
>>>>>>   void __iomem *pcie_data_offset;
>>>>>> +unsigned long flags;
>>>>>> +u32 r;
>>>>>> +
>>>>>> +if (amdgpu_device_skip_hw_access(adev))
>>>>>> +return 0;
>>>>>>
>>>>>>   spin_lock_irqsave(&adev->pcie_idx_lock, flags);
>>>>>>   pcie_index_offset = (void __iomem *)adev->rmmio + pcie_index * 
>>>>>> 4; @@ -641,10 +672,13 @@ u64 amdgpu_device_indirect_rreg64(struct 
>>>>>> amdgpu_device *adev,
>>>>>>     u32 pcie_index, u32 pcie_data,
>>>>>>     u32 reg_addr)
>>>>>>   {
>>>>>> -unsigned long flags;
>>>>>> -u64 r;
>>>>>>   void __iomem *pcie_index_offset;
>>>>>>   void __iomem *pcie_data_offset;
>>>>>> +unsigned long flags;
>>>>>> +u64 r;
>>>>>> +
>>>>>> +if (amdgpu_device_skip_hw_access(adev))
>>>>>> +return 0;
>>>>>>
>>>>>>   spin_lock_irqsave(&adev->pcie_idx_lock, flags);
>>>>>>   pcie_index_offset = (void __iomem *)adev->rmmio + pcie_index * 
>>>>>> 4; @@ -677,9 +711,12 @@ void amdgpu_device_indirect_wreg(struct 
>>>>>> amdgpu_device *adev,
>>>>>>    u32 pcie_index, u32 pcie_data,
>>>>>>    u32 reg_addr, u32 reg_data)
>>>>>>   {
>>>>>> -unsigned long flags;
>>>>>>   void __iomem *pcie_index_offset;
>>>>>>   void __iomem *pcie_data_offset;
>>>>>> +unsigned long flags;
>>>>>> +
>>>>>> +if (amdgpu_device_skip_hw_access(adev))
>>>>>> +return;
>>>>>>
>>>>>>   spin_lock_irqsave(&adev->pcie_idx_lock, flags);
>>>>>>   pcie_index_offset = (void __iomem *)adev->rmmio + pcie_index * 
>>>>>> 4; @@ -706,9 +743,12 @@ void amdgpu_device_indirect_wreg64(struct 
>>>>>> amdgpu_device *adev,
>>>>>>      u32 pcie_index, u32 pcie_data,
>>>>>>      u32 reg_addr, u64 reg_data)
>>>>>>   {
>>>>>> -unsigned long flags;
>>>>>>   void __iomem *pcie_index_offset;
>>>>>>   void __iomem *pcie_data_offset;
>>>>>> +unsigned long flags;
>>>>>> +
>>>>>> +if (amdgpu_device_skip_hw_access(adev))
>>>>>> +return;
>>>>>>
>>>>>>   spin_lock_irqsave(&adev->pcie_idx_lock, flags);
>>>>>>   pcie_index_offset = (void __iomem *)adev->rmmio + pcie_index * 4;
>>>>>> -- 
>>>>>> 2.25.1
>>>>>>
>>>>>
>>>
>