[PATCH v3 1/3] iommu/arm-smmu: Fix spurious interrupts with stall-on-fault
Connor Abbott
cwabbott0 at gmail.com
Thu Jan 23 18:09:33 UTC 2025
On Thu, Jan 23, 2025 at 12:35 PM Prakash Gupta <quic_guptap at quicinc.com> wrote:
>
> On Thu, Jan 23, 2025 at 11:51:27AM +0000, Robin Murphy wrote:
> > On 2025-01-23 11:10 am, Prakash Gupta wrote:
> > > On Wed, Jan 22, 2025 at 03:00:58PM -0500, Connor Abbott wrote:
> > > > + /*
> > > > + * The SMMUv2 architecture specification says that if stall-on-fault is
> > > > + * enabled the correct sequence is to write to SMMU_CBn_FSR to clear
> > > > + * the fault and then write to SMMU_CBn_RESUME. Clear the interrupt
> > > > + * first before running the user's fault handler to make sure we follow
> > > > + * this sequence. It should be ok if there is another fault in the
> > > > + * meantime because we have already read the fault info.
> > > > + */
> > > The context would remain stalled till we write to CBn_RESUME. Which is done
> > > in qcom_adreno_smmu_resume_translation(). For a stalled context further
> > > transactions are not processed and we shouldn't see further faults and
> > > or fault inerrupts. Do you observe faults with stalled context?
> >
> > This aspect isn't exclusive to stalled contexts though - even for "normal"
> > terminated faults, clearing the FSR as soon as we've sampled all the
> > associated fault registers is no bad thing, since if a second fault does
> > occur while we're still reporting the first, we're then more likely to get a
> > full syndrome rather than just the FSR.MULTI bit.
> >
> ARM SMMUv2 spec recommends, in case of reported fault sw should first
> correct the condition which casued the fault, I would interpret this as
> reporting fault to client using callback, and then write CBn_FSR and
> CBn_RESUME in this order. Even in case of reported fault where context is
> not stalled, the first step, IMO, I see no reason why should be any
> different. I agree that delaying fault clearance can result in FSR.MULTI
> being set, but clearning fault before prevent clients to use SCTLR.HUPCF
> on subsequent transactions while they take any debug action. The client
> should be reported fault in the same state it occured. Please refer
> qcom_smmu_context_fault() for this sequence.
It's not possible to implement it the way the spec describes because
correcting the condition which caused the fault cannot always be done
in the client's callback. Sometimes it has to be deferred to a handler
not in IRQ context. However we must clear FSR (except for the SS bit)
before leaving the fault handler because while the transaction is
pending we want to be able to distinguish between a subsequent fault
in another context bank (only FSR.SS will be set and we can ignore and
return IRQ_NONE) and this context bank if they share an interrupt
line. So, moving this up generally doesn't hurt and fixes the case
where the client does resume the transaction inside its handler. I
don't think there's really another way to implement this.
And I have no idea why you think this prevents clients from using
HUPCF, we already use HUPCF in drm/msm and it works fine with this
series.
>
> > > > + arm_smmu_cb_write(smmu, idx, ARM_SMMU_CB_FSR, cfi.fsr);
> > > > +
> > > > ret = report_iommu_fault(&smmu_domain->domain, NULL, cfi.iova,
> > > > cfi.fsynr & ARM_SMMU_CB_FSYNR0_WNR ? IOMMU_FAULT_WRITE : IOMMU_FAULT_READ);
> > > > if (ret == -ENOSYS && __ratelimit(&rs))
> > > > arm_smmu_print_context_fault_info(smmu, idx, &cfi);
> > > > - arm_smmu_cb_write(smmu, idx, ARM_SMMU_CB_FSR, cfi.fsr);
> > > > return IRQ_HANDLED;
> > > > }
> > > > diff --git a/drivers/iommu/arm/arm-smmu/arm-smmu.h b/drivers/iommu/arm/arm-smmu/arm-smmu.h
> > > > index 2dbf3243b5ad2db01e17fb26c26c838942a491be..789c64ff3eb9944c8af37426e005241a8288da20 100644
> > > > --- a/drivers/iommu/arm/arm-smmu/arm-smmu.h
> > > > +++ b/drivers/iommu/arm/arm-smmu/arm-smmu.h
> > > > @@ -216,7 +216,6 @@ enum arm_smmu_cbar_type {
> > > > ARM_SMMU_CB_FSR_TLBLKF)
> > > > #define ARM_SMMU_CB_FSR_FAULT (ARM_SMMU_CB_FSR_MULTI | \
> > > > - ARM_SMMU_CB_FSR_SS | \
> > > Given writing to FSR.SS doesn't clear this bit but write to CBn_RESUME
> > > does, this seems right. This but can be taken as separate patch.
> >
> > This change on its own isn't really useful - all that would achieve is that
> > instead of constantly re-reporting the FSR.SS "fault", the interrupt goes
> > unhandled and the IRQ core ends up disabling it permanently. If anything
> > that's arguably worse, since the storm of context fault reports does at
> > least give a fairly clear indication of what's gone wrong, rather than
> > having to deduce the cause of an "irq n: nobody cared" message entirely by
> > code inspection.
> >
> Does spec allow or do we see reported fault with just FSR.SS bit.
Yes, the spec allows it and we do see it in practice. After this patch
we may still see it in the case where multiple context banks share an
interrupt and the other bank faults, but the correct action is to
ignore it because we've disabled CFIE for this bank already. This is
why this hunk has to be in this patch.
> If answer
> is no then Keeping FSR_SS would be misleading. Here ARM_SMMU_CB_FSR_FAULT
> is used to clear fault bits or check valid faults. Also validity of this
> is not based on rest of the change.
>
> Thanks,
> Prakash
>
> > >
> > > > ARM_SMMU_CB_FSR_UUT | \
> > > > ARM_SMMU_CB_FSR_EF | \
> > > > ARM_SMMU_CB_FSR_PF | \
> > > >
> > > > --
> > > > 2.47.1
> > > >
> >
More information about the Freedreno
mailing list