[PATCH v3 1/3] iommu/arm-smmu: Fix spurious interrupts with stall-on-fault

Connor Abbott cwabbott0 at gmail.com
Thu Jan 23 14:00:17 UTC 2025


On Thu, Jan 23, 2025 at 6:10 AM Prakash Gupta <quic_guptap at quicinc.com> wrote:
>
> On Wed, Jan 22, 2025 at 03:00:58PM -0500, Connor Abbott wrote:
>
> > @@ -125,12 +125,25 @@ static void qcom_adreno_smmu_resume_translation(const void *cookie, bool termina
> >       struct arm_smmu_domain *smmu_domain = (void *)cookie;
> >       struct arm_smmu_cfg *cfg = &smmu_domain->cfg;
> >       struct arm_smmu_device *smmu = smmu_domain->smmu;
> > -     u32 reg = 0;
> > +     u32 reg = 0, sctlr;
> > +     unsigned long flags;
> >
> >       if (terminate)
> >               reg |= ARM_SMMU_RESUME_TERMINATE;
> >
> > +     spin_lock_irqsave(&smmu_domain->cb_lock, flags);
> > +
> >       arm_smmu_cb_write(smmu, cfg->cbndx, ARM_SMMU_CB_RESUME, reg);
> > +
> At this point further transaction can be processed but SCTLR.CFIE is
> cleared so subequent context fault will not generate interrupt till
> SCTLR.CFIE is set.

If you're asking why the spin lock is there, it's because this isn't
true if there's another context bank, they share an interrupt line,
and it happens to fault around the same time. I haven't checked if
that's actually the case for Adreno, but in case this gets used by
other drivers and moved into common code I want it to be as robust as
possible. This is explained in the comment added to
arm_smmu_context_fault(). Also the next commit toggles CFCFG and we
want to serialize against that.

>
> > +     /*
> > +      * Re-enable interrupts after they were disabled by
> > +      * arm_smmu_context_fault().
> > +      */
> > +     sctlr = arm_smmu_cb_read(smmu, cfg->cbndx, ARM_SMMU_CB_SCTLR);
> > +     sctlr |= ARM_SMMU_SCTLR_CFIE;
> > +     arm_smmu_cb_write(smmu, cfg->cbndx, ARM_SMMU_CB_SCTLR, sctlr);
> > +
> > +     spin_unlock_irqrestore(&smmu_domain->cb_lock, flags);
> >  }
> >
> >  static void qcom_adreno_smmu_set_prr_bit(const void *cookie, bool set)
> > diff --git a/drivers/iommu/arm/arm-smmu/arm-smmu.c b/drivers/iommu/arm/arm-smmu/arm-smmu.c
> > index 79afc92e1d8b984dd35c469a3f283ad0c78f3d26..ca1ff59015a63912f0f9c5256452b2b2efa928f1 100644
> > --- a/drivers/iommu/arm/arm-smmu/arm-smmu.c
> > +++ b/drivers/iommu/arm/arm-smmu/arm-smmu.c
> > @@ -463,13 +463,52 @@ static irqreturn_t arm_smmu_context_fault(int irq, void *dev)
> >       if (!(cfi.fsr & ARM_SMMU_CB_FSR_FAULT))
> >               return IRQ_NONE;
> >
> > +     /*
> > +      * On some implementations FSR.SS asserts a context fault
> > +      * interrupt. We do not want this behavior, because resolving the
> > +      * original context fault typically requires operations that cannot be
> > +      * performed in IRQ context but leaving the stall unacknowledged will
> > +      * immediately lead to another spurious interrupt as FSR.SS is still
> > +      * set. Work around this by disabling interrupts for this context bank.
> > +      * It's expected that interrupts are re-enabled after resuming the
> > +      * translation.
> > +      *
> > +      * We have to do this before report_iommu_fault() so that we don't
> > +      * leave interrupts disabled in case the downstream user decides the
> > +      * fault can be resolved inside its fault handler.
> > +      *
> > +      * There is a possible race if there are multiple context banks sharing
> > +      * the same interrupt and both signal an interrupt in between writing
> > +      * RESUME and SCTLR. We could disable interrupts here before we
> > +      * re-enable them in the resume handler, leaving interrupts enabled.
> > +      * Lock the write to serialize it with the resume handler.
> > +      */
> > +     if (cfi.fsr & ARM_SMMU_CB_FSR_SS) {
> > +             u32 val;
> > +
> > +             spin_lock(&smmu_domain->cb_lock);
> > +             val = arm_smmu_cb_read(smmu, idx, ARM_SMMU_CB_SCTLR);
> > +             val &= ~ARM_SMMU_SCTLR_CFIE;
> > +             arm_smmu_cb_write(smmu, idx, ARM_SMMU_CB_SCTLR, val);
> > +             spin_unlock(&smmu_domain->cb_lock);
> > +     }
> > +
> > +     /*
> > +      * The SMMUv2 architecture specification says that if stall-on-fault is
> > +      * enabled the correct sequence is to write to SMMU_CBn_FSR to clear
> > +      * the fault and then write to SMMU_CBn_RESUME. Clear the interrupt
> > +      * first before running the user's fault handler to make sure we follow
> > +      * this sequence. It should be ok if there is another fault in the
> > +      * meantime because we have already read the fault info.
> > +      */
> The context would remain stalled till we write to CBn_RESUME. Which is done
> in qcom_adreno_smmu_resume_translation(). For a stalled context further
> transactions are not processed and we shouldn't see further faults and
> or fault inerrupts. Do you observe faults with stalled context?

Yes. I've observed that on MMU-500 writing RESUME before the interrupt
has been cleared doesn't clear SS. This happened with v2 in the case
where there was already a devcoredump and drm/msm called
qcom_adreno_smmu_resume_translation() immediately from its fault
handler, and we'd get a storm of unhandled interrupts until it was
disabled. Given that the architecture spec says we're supposed to
clear the interrupt first this may have been an attempt to "help"
developers.

>
> > +     arm_smmu_cb_write(smmu, idx, ARM_SMMU_CB_FSR, cfi.fsr);
> > +
> >       ret = report_iommu_fault(&smmu_domain->domain, NULL, cfi.iova,
> >               cfi.fsynr & ARM_SMMU_CB_FSYNR0_WNR ? IOMMU_FAULT_WRITE : IOMMU_FAULT_READ);
> >
> >       if (ret == -ENOSYS && __ratelimit(&rs))
> >               arm_smmu_print_context_fault_info(smmu, idx, &cfi);
> >
> > -     arm_smmu_cb_write(smmu, idx, ARM_SMMU_CB_FSR, cfi.fsr);
> >       return IRQ_HANDLED;
> >  }
> >
> > diff --git a/drivers/iommu/arm/arm-smmu/arm-smmu.h b/drivers/iommu/arm/arm-smmu/arm-smmu.h
> > index 2dbf3243b5ad2db01e17fb26c26c838942a491be..789c64ff3eb9944c8af37426e005241a8288da20 100644
> > --- a/drivers/iommu/arm/arm-smmu/arm-smmu.h
> > +++ b/drivers/iommu/arm/arm-smmu/arm-smmu.h
> > @@ -216,7 +216,6 @@ enum arm_smmu_cbar_type {
> >                                        ARM_SMMU_CB_FSR_TLBLKF)
> >
> >  #define ARM_SMMU_CB_FSR_FAULT                (ARM_SMMU_CB_FSR_MULTI |        \
> > -                                      ARM_SMMU_CB_FSR_SS |           \
> Given writing to FSR.SS doesn't clear this bit but write to CBn_RESUME
> does, this seems right. This but can be taken as separate patch.
>
> >                                        ARM_SMMU_CB_FSR_UUT |          \
> >                                        ARM_SMMU_CB_FSR_EF |           \
> >                                        ARM_SMMU_CB_FSR_PF |           \
> >
> > --
> > 2.47.1
> >


More information about the Freedreno mailing list