<!DOCTYPE html> <html> <head> <meta http-equiv="Content-Type" content="text/html; charset=UTF-8"> </head> <body> Am 24.06.24 um 14:24 schrieb Lazar, Lijo: <blockquote type="cite" cite="mid:b572a287-4ce3-447a-a211-c5caf9ec09cb@amd.com"> <pre class="moz-quote-pre" wrap="">On 6/24/2024 5:31 PM, Christian König wrote: </pre> <blockquote type="cite"> <pre class="moz-quote-pre" wrap="">Am 24.06.24 um 13:57 schrieb Lazar, Lijo: </pre> <blockquote type="cite"> <pre class="moz-quote-pre" wrap="">On 6/24/2024 5:19 PM, Christian König wrote: </pre> <blockquote type="cite"> <pre class="moz-quote-pre" wrap="">Am 24.06.24 um 11:52 schrieb Lazar, Lijo: </pre> <blockquote type="cite"> <pre class="moz-quote-pre" wrap="">On 6/24/2024 3:08 PM, Christian König wrote: </pre> <blockquote type="cite"> <pre class="moz-quote-pre" wrap="">Am 24.06.24 um 08:34 schrieb Lazar, Lijo: </pre> <blockquote type="cite"> <pre class="moz-quote-pre" wrap="">On 6/24/2024 12:01 PM, Vignesh Chander wrote: </pre> <blockquote type="cite"> <pre class="moz-quote-pre" wrap="">correctly handle the case when trylock fails when gpu is about to be reset by dropping the request instead of using mmio Signed-off-by: Vignesh Chander <a class="moz-txt-link-rfc2396E" href="mailto:Vignesh.Chander@amd.com"><Vignesh.Chander@amd.com></a> </pre> </blockquote> <pre class="moz-quote-pre" wrap="">Reviewed-by: Lijo Lazar <a class="moz-txt-link-rfc2396E" href="mailto:lijo.lazar@amd.com"><lijo.lazar@amd.com></a> Thanks, Lijo </pre> <blockquote type="cite"> <pre class="moz-quote-pre" wrap="">--- drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 38 ++++++++++++---------- 1 file changed, 21 insertions(+), 17 deletions(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c index a7ce8280b17ce7..3cfd24699d691d 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c @@ -613,10 +613,11 @@ uint32_t amdgpu_device_rreg(struct amdgpu_device *adev, if ((reg * 4) < adev->rmmio_size) { if (!(acc_flags & AMDGPU_REGS_NO_KIQ) && - amdgpu_sriov_runtime(adev) && - down_read_trylock(&adev->reset_domain->sem)) { - ret = amdgpu_kiq_rreg(adev, reg, 0); - up_read(&adev->reset_domain->sem); + amdgpu_sriov_runtime(adev) { + if (down_read_trylock(&adev->reset_domain->sem)) { + ret = amdgpu_kiq_rreg(adev, reg, 0); + up_read(&adev->reset_domain->sem); + } </pre> </blockquote> </blockquote> <pre class="moz-quote-pre" wrap="">What has actually changed here? As far as I can see that isn't functionally different to the existing code. </pre> </blockquote> <pre class="moz-quote-pre" wrap="">In earlier logic, if it fails to get the lock, it takes the 'else' path. The 'else' path is not meant for sriov/vf. </pre> </blockquote> <pre class="moz-quote-pre" wrap="">Yeah, but the read or write is then just silently dropped. That can't be correct either. </pre> </blockquote> <pre class="moz-quote-pre" wrap="">These are void funcs. Moreover, the drops will happen if there is concurrent access from another thread while a reset is going on. That is expected and those accesses during a reset won't help anyway. </pre> </blockquote> <pre class="moz-quote-pre" wrap=""> Nope, Teddy has been working on that for a while as well. </pre> </blockquote> <pre class="moz-quote-pre" wrap=""> This silent drop is already there in bare metal. <a class="moz-txt-link-freetext" href="https://elixir.bootlin.com/linux/v6.10-rc5/source/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c#L738">https://elixir.bootlin.com/linux/v6.10-rc5/source/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c#L738</a></pre> </blockquote> That's not for reset, but for hotplug. In that case the silent drop is ok because the device is gone anyway. <blockquote type="cite" cite="mid:b572a287-4ce3-447a-a211-c5caf9ec09cb@amd.com"> <blockquote type="cite"> <pre class="moz-quote-pre" wrap=""> Trying to make those accesses while the reset is going on is wrong in the first place. What we need to do is to grab the reset lock in the higher level function so that the whole sequences of reads and writes are protected. What this logic here does is to use readl()/writel() from the reset thread itself. Dropping that is incorrect and could lead to broken reset. </pre> </blockquote> <pre class="moz-quote-pre" wrap=""> This doesn't change anything for a reset thread. This fixes an already broken path for sriov where it attempts a direct readl()/writel() if taking the lock fails. That is even more broken.</pre> </blockquote> No, as far as I know that's correct behavior. The lock is there to make sure we can talk to the KIQ. When that isn't possible because of reset we should use direct register access, that's completely intentional behavior. What we could do is to remove the down_read_trylock() from amdgpu_device_skip_hw_access() and always assert that the reset domain lock is held. Then we should also remove the down_read_trylock() from amdgpu_device_[rw]reg() and replace it with !amdgpu_in_reset() as it used to be. This way we will get proper lockdep warnings for all the call chains missing to grab the reset lock. Regards, Christian. <blockquote type="cite" cite="mid:b572a287-4ce3-447a-a211-c5caf9ec09cb@amd.com"> <pre class="moz-quote-pre" wrap=""> Thanks, Lijo </pre> <blockquote type="cite"> <pre class="moz-quote-pre" wrap=""> So clear NAK from my side to this patch here. Regards, Christian. </pre> <blockquote type="cite"> <pre class="moz-quote-pre" wrap=""> Thanks, Lijo </pre> <blockquote type="cite"> <pre class="moz-quote-pre" wrap="">Regards, Christian. </pre> <blockquote type="cite"> <pre class="moz-quote-pre" wrap="">Thanks, Lijo </pre> <blockquote type="cite"> <pre class="moz-quote-pre" wrap="">Regards, Christian. </pre> <blockquote type="cite"> <blockquote type="cite"> <pre class="moz-quote-pre" wrap=""> } else { ret = readl(((void __iomem *)adev->rmmio) + (reg * 4)); } @@ -681,10 +682,11 @@ uint32_t amdgpu_device_xcc_rreg(struct amdgpu_device *adev, &rlcg_flag)) { ret = amdgpu_virt_rlcg_reg_rw(adev, reg, 0, rlcg_flag, GET_INST(GC, xcc_id)); } else if (!(acc_flags & AMDGPU_REGS_NO_KIQ) && - amdgpu_sriov_runtime(adev) && - down_read_trylock(&adev->reset_domain->sem)) { - ret = amdgpu_kiq_rreg(adev, reg, xcc_id); - up_read(&adev->reset_domain->sem); + amdgpu_sriov_runtime(adev) { + if (down_read_trylock(&adev->reset_domain->sem)) { + ret = amdgpu_kiq_rreg(adev, reg, xcc_id); + up_read(&adev->reset_domain->sem); + } } else { ret = readl(((void __iomem *)adev->rmmio) + (reg * 4)); } @@ -740,10 +742,11 @@ void amdgpu_device_wreg(struct amdgpu_device *adev, if ((reg * 4) < adev->rmmio_size) { if (!(acc_flags & AMDGPU_REGS_NO_KIQ) && - amdgpu_sriov_runtime(adev) && - down_read_trylock(&adev->reset_domain->sem)) { - amdgpu_kiq_wreg(adev, reg, v, 0); - up_read(&adev->reset_domain->sem); + amdgpu_sriov_runtime(adev) { + if (down_read_trylock(&adev->reset_domain->sem)) { + amdgpu_kiq_wreg(adev, reg, v, 0); + up_read(&adev->reset_domain->sem); + } } else { writel(v, ((void __iomem *)adev->rmmio) + (reg * 4)); } @@ -812,11 +815,12 @@ void amdgpu_device_xcc_wreg(struct amdgpu_device *adev, &rlcg_flag)) { amdgpu_virt_rlcg_reg_rw(adev, reg, v, rlcg_flag, GET_INST(GC, xcc_id)); } else if (!(acc_flags & AMDGPU_REGS_NO_KIQ) && - amdgpu_sriov_runtime(adev) && - down_read_trylock(&adev->reset_domain->sem)) { - amdgpu_kiq_wreg(adev, reg, v, xcc_id); - up_read(&adev->reset_domain->sem); - } else { + amdgpu_sriov_runtime(adev) { + if (down_read_trylock(&adev->reset_domain->sem)) { + amdgpu_kiq_wreg(adev, reg, v, xcc_id); + up_read(&adev->reset_domain->sem); + } + } else { writel(v, ((void __iomem *)adev->rmmio) + (reg * 4)); } } else { </pre> </blockquote> </blockquote> </blockquote> </blockquote> </blockquote> </blockquote> <pre class="moz-quote-pre" wrap=""> </pre> </blockquote> </blockquote> </body> </html>