[PATCH v3 1/3] iommu/arm-smmu: Fix spurious interrupts with stall-on-fault

Thu Jan 23 20:29:38 UTC 2025

On Thu, Jan 23, 2025 at 2:26 PM Prakash Gupta <quic_guptap at quicinc.com> wrote:
>
> On Thu, Jan 23, 2025 at 09:00:17AM -0500, Connor Abbott wrote:
> > On Thu, Jan 23, 2025 at 6:10 AM Prakash Gupta <quic_guptap at quicinc.com> wrote:
> > >
> > > On Wed, Jan 22, 2025 at 03:00:58PM -0500, Connor Abbott wrote:
> > >
> > > > +     /*
> > > > +      * On some implementations FSR.SS asserts a context fault
> > > > +      * interrupt. We do not want this behavior, because resolving the
> > > > +      * original context fault typically requires operations that cannot be
> > > > +      * performed in IRQ context but leaving the stall unacknowledged will
> > > > +      * immediately lead to another spurious interrupt as FSR.SS is still
> > > > +      * set. Work around this by disabling interrupts for this context bank.
> > > > +      * It's expected that interrupts are re-enabled after resuming the
> > > > +      * translation.
> > > > +      *
> > > > +      * We have to do this before report_iommu_fault() so that we don't
> > > > +      * leave interrupts disabled in case the downstream user decides the
> > > > +      * fault can be resolved inside its fault handler.
> > > > +      *
> > > > +      * There is a possible race if there are multiple context banks sharing
> > > > +      * the same interrupt and both signal an interrupt in between writing
> > > > +      * RESUME and SCTLR. We could disable interrupts here before we
> Not sure if multiple context bank with shared interrupt supported for
> arm-smmu driver, but even if does than context fault handler they would
> operate on their respective context register space, so I don't see the race
> at context register update.
>
> > > > +      * re-enable them in the resume handler, leaving interrupts enabled.
> > > > +      * Lock the write to serialize it with the resume handler.
> > > > +      */
> > > > +     if (cfi.fsr & ARM_SMMU_CB_FSR_SS) {
> > > > +             u32 val;
> > > > +
> > > > +             spin_lock(&smmu_domain->cb_lock);
> > > > +             val = arm_smmu_cb_read(smmu, idx, ARM_SMMU_CB_SCTLR);
> > > > +             val &= ~ARM_SMMU_SCTLR_CFIE;
> > > > +             arm_smmu_cb_write(smmu, idx, ARM_SMMU_CB_SCTLR, val);
> > > > +             spin_unlock(&smmu_domain->cb_lock);
> > > > +     }
> > > > +
> > > > +     /*
> > > > +      * The SMMUv2 architecture specification says that if stall-on-fault is
> > > > +      * enabled the correct sequence is to write to SMMU_CBn_FSR to clear
> > > > +      * the fault and then write to SMMU_CBn_RESUME. Clear the interrupt
> > > > +      * first before running the user's fault handler to make sure we follow
> > > > +      * this sequence. It should be ok if there is another fault in the
> > > > +      * meantime because we have already read the fault info.
> > > > +      */
> qcom_adreno_smmu_get_fault_info() reads the fault info as part of client
> fault hanlder. So it would not be ok to clear FSR before reporting the
> fault to client.

That's a good point, but as long as stall-on-fault is enabled it
doesn't matter because subsequent transactions that fault will be
stalled. Patch 3 of this series disables stall-on-fault after the
first fault in drm/msm, but we don't care as much about the accuracy
of those subsequent faults.

>
> > > The context would remain stalled till we write to CBn_RESUME. Which is done
> > > in qcom_adreno_smmu_resume_translation(). For a stalled context further
> > > transactions are not processed and we shouldn't see further faults and
> > > or fault inerrupts. Do you observe faults with stalled context?
> >
> > Yes. I've observed that on MMU-500 writing RESUME before the interrupt
> > has been cleared doesn't clear SS. This happened with v2 in the case
> > where there was already a devcoredump and drm/msm called
> > qcom_adreno_smmu_resume_translation() immediately from its fault
> > handler, and we'd get a storm of unhandled interrupts until it was
> > disabled. Given that the architecture spec says we're supposed to
> > clear the interrupt first this may have been an attempt to "help"
> > developers.
> >
>
> Just to clarify, present sequence is:
> 1. context fault with stall enable. FSR.SS set.
> 2. Report fault to client
> 3. resume/terminate stalled transaction
> 4. clear FSR
>
> At what point when you try #2->#3->#4 or #4->#2->#3 sequence, is FSR.SS
> cleared and interrupt storm is observed.  The way CFIE disable is helping
> with current patch indicates write FSR is unstalling context and subsequent
> transactions are faulted.  Do you stop getting interrupt storm after write
> RESUME. If you can mention your SCTLR configuration and FSR state in test
> sequence, it would be clearer.
>
> An aside, If reducing delay between FSR and RESUME write helps then both
> can be done as part of qcom_adreno_smmu_resume_translation(). This will
> follow spec recommendation and also make sure fault registers are not
> cleared before reporting fault to client.
>
> Thanks,
> Prakash