[PATCH v3 1/3] iommu/arm-smmu: Fix spurious interrupts with stall-on-fault

Thu Jan 23 19:25:52 UTC 2025

On Thu, Jan 23, 2025 at 09:00:17AM -0500, Connor Abbott wrote:
> On Thu, Jan 23, 2025 at 6:10 AM Prakash Gupta <quic_guptap at quicinc.com> wrote:
> >
> > On Wed, Jan 22, 2025 at 03:00:58PM -0500, Connor Abbott wrote:
> >
> > > +     /*
> > > +      * On some implementations FSR.SS asserts a context fault
> > > +      * interrupt. We do not want this behavior, because resolving the
> > > +      * original context fault typically requires operations that cannot be
> > > +      * performed in IRQ context but leaving the stall unacknowledged will
> > > +      * immediately lead to another spurious interrupt as FSR.SS is still
> > > +      * set. Work around this by disabling interrupts for this context bank.
> > > +      * It's expected that interrupts are re-enabled after resuming the
> > > +      * translation.
> > > +      *
> > > +      * We have to do this before report_iommu_fault() so that we don't
> > > +      * leave interrupts disabled in case the downstream user decides the
> > > +      * fault can be resolved inside its fault handler.
> > > +      *
> > > +      * There is a possible race if there are multiple context banks sharing
> > > +      * the same interrupt and both signal an interrupt in between writing
> > > +      * RESUME and SCTLR. We could disable interrupts here before we
Not sure if multiple context bank with shared interrupt supported for
arm-smmu driver, but even if does than context fault handler they would
operate on their respective context register space, so I don't see the race
at context register update. 

> > > +      * re-enable them in the resume handler, leaving interrupts enabled.
> > > +      * Lock the write to serialize it with the resume handler.
> > > +      */
> > > +     if (cfi.fsr & ARM_SMMU_CB_FSR_SS) {
> > > +             u32 val;
> > > +
> > > +             spin_lock(&smmu_domain->cb_lock);
> > > +             val = arm_smmu_cb_read(smmu, idx, ARM_SMMU_CB_SCTLR);
> > > +             val &= ~ARM_SMMU_SCTLR_CFIE;
> > > +             arm_smmu_cb_write(smmu, idx, ARM_SMMU_CB_SCTLR, val);
> > > +             spin_unlock(&smmu_domain->cb_lock);
> > > +     }
> > > +
> > > +     /*
> > > +      * The SMMUv2 architecture specification says that if stall-on-fault is
> > > +      * enabled the correct sequence is to write to SMMU_CBn_FSR to clear
> > > +      * the fault and then write to SMMU_CBn_RESUME. Clear the interrupt
> > > +      * first before running the user's fault handler to make sure we follow
> > > +      * this sequence. It should be ok if there is another fault in the
> > > +      * meantime because we have already read the fault info.
> > > +      */
qcom_adreno_smmu_get_fault_info() reads the fault info as part of client
fault hanlder. So it would not be ok to clear FSR before reporting the
fault to client.

> > The context would remain stalled till we write to CBn_RESUME. Which is done
> > in qcom_adreno_smmu_resume_translation(). For a stalled context further
> > transactions are not processed and we shouldn't see further faults and
> > or fault inerrupts. Do you observe faults with stalled context?
> 
> Yes. I've observed that on MMU-500 writing RESUME before the interrupt
> has been cleared doesn't clear SS. This happened with v2 in the case
> where there was already a devcoredump and drm/msm called
> qcom_adreno_smmu_resume_translation() immediately from its fault
> handler, and we'd get a storm of unhandled interrupts until it was
> disabled. Given that the architecture spec says we're supposed to
> clear the interrupt first this may have been an attempt to "help"
> developers.
> 

Just to clarify, present sequence is:
1. context fault with stall enable. FSR.SS set.
2. Report fault to client
3. resume/terminate stalled transaction
4. clear FSR

At what point when you try #2->#3->#4 or #4->#2->#3 sequence, is FSR.SS
cleared and interrupt storm is observed.  The way CFIE disable is helping
with current patch indicates write FSR is unstalling context and subsequent
transactions are faulted.  Do you stop getting interrupt storm after write
RESUME. If you can mention your SCTLR configuration and FSR state in test
sequence, it would be clearer. 

An aside, If reducing delay between FSR and RESUME write helps then both
can be done as part of qcom_adreno_smmu_resume_translation(). This will
follow spec recommendation and also make sure fault registers are not
cleared before reporting fault to client.

Thanks,
Prakash