[PATCH 3/3] drm/amdkfd: add RAS poison consumption support for utcl2

Zhou1, Tao Tao.Zhou1 at amd.com
Tue Mar 15 07:26:43 UTC 2022


[AMD Official Use Only]



> -----Original Message-----
> From: Kuehling, Felix <Felix.Kuehling at amd.com>
> Sent: Tuesday, March 15, 2022 2:41 AM
> To: Zhou1, Tao <Tao.Zhou1 at amd.com>; amd-gfx at lists.freedesktop.org; Zhang,
> Hawking <Hawking.Zhang at amd.com>; Yang, Stanley
> <Stanley.Yang at amd.com>; Chai, Thomas <YiPeng.Chai at amd.com>
> Subject: Re: [PATCH 3/3] drm/amdkfd: add RAS poison consumption support for
> utcl2
> 
> I'm not sure I understand this change. It looks like you will check the RAS poison
> status on every UTCL2 VM fault? Is that because there is no dedicated interrupt
> source or client ID to distinguish UTCL2 poison consumption from VM faults?
> 
> Why is kfd_signal_poison_consumed_event not done for UTCL2 poison
> consumption? The comment says, because it's signaled to user mode as a VM
> fault. But a VM fault and a poison consumption event are different.
> You're basically sending the wrong event to user mode. Maybe it doesn't make a
> big difference in practice at the moment. But that could change in the future.

[Tao] yes, no dedicated interrupt source, so I need to check FED bit in VM_L2_PROTECTION_FAULT_STATUS to distinguish it from general VM fault.

I inject error to PDE1 to trigger UTCL2 poison consumption, both kfd_signal_poison_consumed_event and kfd_signal_vm_fault_event can be handled by ROCr in my experiment (bus error vs. coredump and APP can be killed).
I think we can take UTCL2 consumption as a special VM fault, so I select kfd_signal_vm_fault_event. Anyway, this can be discussed furtherly.

> 
> Two more comments inline ...
> 
> 
> Am 2022-03-14 um 03:03 schrieb Tao Zhou:
> > Do RAS page retirement and use gpu reset as fallback in utcl2 fault
> > handler.
> >
> > Signed-off-by: Tao Zhou <tao.zhou1 at amd.com>
> > ---
> >   .../gpu/drm/amd/amdkfd/kfd_int_process_v9.c   | 23 ++++++++++++++++---
> >   1 file changed, 20 insertions(+), 3 deletions(-)
> >
> > diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_int_process_v9.c
> > b/drivers/gpu/drm/amd/amdkfd/kfd_int_process_v9.c
> > index f7def0bf0730..3991f71d865b 100644
> > --- a/drivers/gpu/drm/amd/amdkfd/kfd_int_process_v9.c
> > +++ b/drivers/gpu/drm/amd/amdkfd/kfd_int_process_v9.c
> > @@ -93,11 +93,12 @@ enum SQ_INTERRUPT_ERROR_TYPE {
> >   static void event_interrupt_poison_consumption(struct kfd_dev *dev,
> >   				const uint32_t *ih_ring_entry)
> >   {
> > -	uint16_t source_id, pasid;
> > +	uint16_t source_id, client_id, pasid;
> >   	int ret = -EINVAL;
> >   	struct kfd_process *p;
> >
> >   	source_id = SOC15_SOURCE_ID_FROM_IH_ENTRY(ih_ring_entry);
> > +	client_id = SOC15_CLIENT_ID_FROM_IH_ENTRY(ih_ring_entry);
> >   	pasid = SOC15_PASID_FROM_IH_ENTRY(ih_ring_entry);
> >
> >   	p = kfd_lookup_process_by_pasid(pasid);
> > @@ -110,6 +111,7 @@ static void
> event_interrupt_poison_consumption(struct kfd_dev *dev,
> >   		return;
> >   	}
> >
> > +	pr_debug("RAS poison consumption handling\n");
> >   	atomic_set(&p->poison, 1);
> >   	kfd_unref_process(p);
> >
> > @@ -119,10 +121,14 @@ static void
> event_interrupt_poison_consumption(struct kfd_dev *dev,
> >   		break;
> >   	case SOC15_INTSRC_SDMA_ECC:
> >   	default:
> > +		if (client_id == SOC15_IH_CLIENTID_UTCL2)
> > +			ret = kfd_dqm_evict_pasid(dev->dqm, pasid);
> >   		break;
> >   	}
> >
> > -	kfd_signal_poison_consumed_event(dev, pasid);
> > +	/* utcl2 page fault has its own vm fault event */
> > +	if (client_id != SOC15_IH_CLIENTID_UTCL2)
> > +		kfd_signal_poison_consumed_event(dev, pasid);
> >
> >   	/* resetting queue passes, do page retirement without gpu reset
> >   	 * resetting queue fails, fallback to gpu reset solution @@ -314,7
> > +320,18 @@ static void event_interrupt_wq_v9(struct kfd_dev *dev,
> >   		info.prot_write = ring_id & 0x20;
> >
> >   		kfd_smi_event_update_vmfault(dev, pasid);
> > -		kfd_dqm_evict_pasid(dev->dqm, pasid);
> > +
> > +		if (client_id == SOC15_IH_CLIENTID_UTCL2 &&
> > +		    dev->kfd2kgd->is_ras_utcl2_poison &&
> > +		    dev->kfd2kgd->is_ras_utcl2_poison(dev->adev, client_id)) {
> > +			event_interrupt_poison_consumption(dev,
> ih_ring_entry);
> > +
> > +			if (dev->kfd2kgd->utcl2_fault_clear)
> > +				dev->kfd2kgd->utcl2_fault_clear(dev->adev);
> > +		}
> > +		else
> 
> Else should be on the same line as }. Please run checkpatch.pl, it would catch this.
> It's also good practice to use {}-braces around the else-branch if you have
> braces around the if-branch.

[Tao] accepted

> 
> 
> > +			kfd_dqm_evict_pasid(dev->dqm, pasid);
> > +
> >   		kfd_signal_vm_fault_event(dev, pasid, &info);
> 
> If you move kfd_signal_vm_fault_event into the else-branch, you can
> unconditionally send the correct poison consumption event to user mode in
> event_interrupt_poison_consumption.
> 
> Regards,
>    Felix
> 
> 
> >   	}
> >   }


More information about the amd-gfx mailing list