[PATCH v5 3/5] iommu/arm-smmu: Fix spurious interrupts with stall-on-fault

Thu May 15 14:46:56 UTC 2025

On Tue, May 06, 2025 at 11:18:44AM -0400, Connor Abbott wrote:
> On Tue, May 6, 2025 at 10:53 AM Will Deacon <will at kernel.org> wrote:
> >
> > On Tue, May 06, 2025 at 10:08:05AM -0400, Connor Abbott wrote:
> > > On Tue, May 6, 2025 at 8:24 AM Will Deacon <will at kernel.org> wrote:
> > > > On Wed, Mar 19, 2025 at 10:44:02AM -0400, Connor Abbott wrote:
> > > > > diff --git a/drivers/iommu/arm/arm-smmu/arm-smmu.c b/drivers/iommu/arm/arm-smmu/arm-smmu.c
> > > > > index c7b5d7c093e71050d29a834c8d33125e96b04d81..9927f3431a2eab913750e6079edc6393d1938c98 100644
> > > > > --- a/drivers/iommu/arm/arm-smmu/arm-smmu.c
> > > > > +++ b/drivers/iommu/arm/arm-smmu/arm-smmu.c
> > > > > @@ -470,13 +470,52 @@ static irqreturn_t arm_smmu_context_fault(int irq, void *dev)
> > > > >       if (!(cfi->fsr & ARM_SMMU_CB_FSR_FAULT))
> > > > >               return IRQ_NONE;
> > > > >
> > > > > +     /*
> > > > > +      * On some implementations FSR.SS asserts a context fault
> > > > > +      * interrupt. We do not want this behavior, because resolving the
> > > > > +      * original context fault typically requires operations that cannot be
> > > > > +      * performed in IRQ context but leaving the stall unacknowledged will
> > > > > +      * immediately lead to another spurious interrupt as FSR.SS is still
> > > > > +      * set. Work around this by disabling interrupts for this context bank.
> > > > > +      * It's expected that interrupts are re-enabled after resuming the
> > > > > +      * translation.
> > > >
> > > > s/translation/transaction/
> > > >
> > > > > +      *
> > > > > +      * We have to do this before report_iommu_fault() so that we don't
> > > > > +      * leave interrupts disabled in case the downstream user decides the
> > > > > +      * fault can be resolved inside its fault handler.
> > > > > +      *
> > > > > +      * There is a possible race if there are multiple context banks sharing
> > > > > +      * the same interrupt and both signal an interrupt in between writing
> > > > > +      * RESUME and SCTLR. We could disable interrupts here before we
> > > > > +      * re-enable them in the resume handler, leaving interrupts enabled.
> > > > > +      * Lock the write to serialize it with the resume handler.
> > > > > +      */
> > > >
> > > > I'm struggling to understand this last part. If the resume handler runs
> > > > synchronously from report_iommu_fault(), then there's no need for
> > > > locking because we're in interrupt context. If the resume handler can
> > > > run asynchronously from report_iommu_fault(), then the locking doesn't
> > > > help because the code below could clear CFIE right after the resume
> > > > handler has set it.
> > >
> > > The problem is indeed when the resume handler runs asynchronously.
> > > Clearing CFIE right after the resume handler has set it is normal and
> > > expected. The issue is the opposite, i.e. something like:
> > >
> > > - Resume handler writes RESUME and stalls for some reason
> > > - The interrupt handler runs through and clears CFIE while it's already cleared
> > > - Resume handler sets CFIE, assuming that the handler hasn't run yet
> > > but it actually has
> > >
> > > This wouldn't happen with only one context bank, because we wouldn't
> > > get an interrupt until the resume handler sets CFIE, but with multiple
> > > context banks and a shared interrupt line we could get a "spurious"
> > > interrupt due to a fault in an earlier context bank that becomes not
> > > spurious if the resume handler writes RESUME before the context fault
> > > handler for this bank reads FSR above.
> >
> > Ah, gotcha. Thanks for the explanation.
> >
> > If we moved the RESUME+CFIE into the interrupt handler after the call
> > to report_iommu_fault(), would it be possible to run the handler as a
> > threaded irq (see 'context_fault_needs_threaded_irq') and handle the
> > callback synchronously? In that case, I think we could avoid taking the
> > lock if we wrote CFIE _before_ RESUME.
> >
> 
> We need the lock anyway due to the parallel manipulation of CFCFG in
> the same register introduced in the next patch. Expanding it to also
> cover the write to RESUME is not a huge deal. Also, doing it
> synchronously would require rewriting the fault handling in drm/msm
> and again I'm trying to fix this serious stability problem now as soon
> as possible without getting dragged into rewriting the whole thing.

This has never worked though, right? In which case, we should fix it
properly rather than papering over the mess.

Georgi (CC'd) added support for threaded interrupts specifically to
permit sleeping operations in the fault handler. You should be able to
use that and I don't understand why that would require "rewriting the
whole thing". You can kick the async work and then wait for it to
complete, no?

That would then open the door to handling the RESUME in the core driver
in future based on the return value from report_iommu_fault().

You also need to fix qcom_tbu_halt() as I mentioned before.

Will