[PATCH] drm/amdkfd: force raven as "dgpu" path

Tue Aug 11 11:45:54 UTC 2020

On Fri, Aug 07, 2020 at 11:00:38PM +0800, Kuehling, Felix wrote:
> Am 2020-08-07 um 4:25 a.m. schrieb Huang Rui:
> > We still have a few iommu issues which need to address, so force raven
> > as "dgpu" path for the moment.
> >
> > Will enable IOMMUv2 since the issues are fixed.
> 
> Do you mean "_when_ the issues are fixed"?

Yes.

> The current iommuv2 troubles aside, I think this change breaks existing
> user mode. The existing Thunk is not prepared to see Raven as a dGPU. So
> this is not something we want to do in an upstream Linux kernel change.

Do you mean it may break the thunk without setting "is_dgpu" flag in the
hsa_gfxip_table for raven?

> 
> I suggest using the ignore_crat module parameter for the workaround
> instead. You may need to duplicate the raven_device_info and pick the
> right one depending on whether it is a dGPU or an APU. The only
> difference would be the need_iommu_device option. If ignore_crat is set,
> you can support Raven as a dGPU and require a corresponding Thunk change
> that conditionally support Raven as a dGPU.
> 
> I think such a change would also be the right direction for supporting
> Raven more universally in the future. It can be extended to
> conditionally treat Raven as a dGPU automatically in some situations:
> 
>   * broken or missing CRAT table
>   * IOMMUv2 disabled
> 
> Those are all situations where the current driver is broken anyway (and
> always has been), so it would not be a kernel change that breaks
> existing user mode.

OK, I think I understand it. We should use a parameter/flag such as
ignore_crat or something else to inform the user mode which path we should
go, treat it as dgpu or apu. Then thunk can detect the flag from kernel to
know how to handle the case. Am I right?

> 
> In addition the Thunk could be changed to downgrade a Raven APU to dGPU
> (by splitting the APU node into a separate CPU and dGPU node) if other
> dGPUs are present in the systems to disable all the APU-specific code
> paths and allow all the GPUs to work together seamlessly with SVM.

Thanks! Let me look at the thunk again.

For now, based on your comments, I would like to update patch as:

- Modify the ignore_crat parameter as tristate: "use", "ignore", and
  "auto". Usually, by default is "auto = use", but in some special case
  such as iommu v2 broken or no iommu v2 support yet (Renoir), we have to
  set "auto = ignore". Then treat it as "dgpu". And if CRAT table is
  broken/missing, we will fall back to treat it as "dgpu" as well.

What do you think of it?

Thanks,
Ray

> 
> Regards,
>   Felix
> 
> 
> > Signed-off-by: Huang Rui <ray.huang at amd.com>
> > ---
> >  drivers/gpu/drm/amd/amdkfd/kfd_crat.c   | 6 ++++++
> >  drivers/gpu/drm/amd/amdkfd/kfd_device.c | 4 ++--
> >  2 files changed, 8 insertions(+), 2 deletions(-)
> >
> > diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_crat.c b/drivers/gpu/drm/amd/amdkfd/kfd_crat.c
> > index 6a250f8fcfb8..66d9f7280fe8 100644
> > --- a/drivers/gpu/drm/amd/amdkfd/kfd_crat.c
> > +++ b/drivers/gpu/drm/amd/amdkfd/kfd_crat.c
> > @@ -22,6 +22,7 @@
> >  
> >  #include <linux/pci.h>
> >  #include <linux/acpi.h>
> > +#include <asm/processor.h>
> >  #include "kfd_crat.h"
> >  #include "kfd_priv.h"
> >  #include "kfd_topology.h"
> > @@ -781,6 +782,11 @@ int kfd_create_crat_image_acpi(void **crat_image, size_t *size)
> >  		return -ENODATA;
> >  	}
> >  
> > +	/* Raven series will go with the fallback path to use virtual CRAT */
> > +	if (boot_cpu_data.x86_vendor == X86_VENDOR_AMD &&
> > +	    boot_cpu_data.x86 == 0x17)
> > +		return -ENODATA;
> > +
> >  	pcrat_image = kmemdup(crat_table, crat_table->length, GFP_KERNEL);
> >  	if (!pcrat_image)
> >  		return -ENOMEM;
> > diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device.c b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
> > index d5e790f046b4..93179c928371 100644
> > --- a/drivers/gpu/drm/amd/amdkfd/kfd_device.c
> > +++ b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
> > @@ -116,6 +116,7 @@ static const struct kfd_device_info carrizo_device_info = {
> >  	.num_xgmi_sdma_engines = 0,
> >  	.num_sdma_queues_per_engine = 2,
> >  };
> > +#endif
> >  
> >  static const struct kfd_device_info raven_device_info = {
> >  	.asic_family = CHIP_RAVEN,
> > @@ -128,13 +129,12 @@ static const struct kfd_device_info raven_device_info = {
> >  	.num_of_watch_points = 4,
> >  	.mqd_size_aligned = MQD_SIZE_ALIGNED,
> >  	.supports_cwsr = true,
> > -	.needs_iommu_device = true,
> > +	.needs_iommu_device = false,
> >  	.needs_pci_atomics = true,
> >  	.num_sdma_engines = 1,
> >  	.num_xgmi_sdma_engines = 0,
> >  	.num_sdma_queues_per_engine = 2,
> >  };
> > -#endif
> >  
> >  static const struct kfd_device_info hawaii_device_info = {
> >  	.asic_family = CHIP_HAWAII,