[PATCH] drm/amdkfd: force raven as "dgpu" path

Felix Kuehling felix.kuehling at amd.com
Tue Aug 11 13:08:23 UTC 2020


Am 2020-08-11 um 7:45 a.m. schrieb Huang Rui:
> On Fri, Aug 07, 2020 at 11:00:38PM +0800, Kuehling, Felix wrote:
>> Am 2020-08-07 um 4:25 a.m. schrieb Huang Rui:
>>> We still have a few iommu issues which need to address, so force raven
>>> as "dgpu" path for the moment.
>>>
>>> Will enable IOMMUv2 since the issues are fixed.
>> Do you mean "_when_ the issues are fixed"?
> Yes.
>  
>> The current iommuv2 troubles aside, I think this change breaks existing
>> user mode. The existing Thunk is not prepared to see Raven as a dGPU. So
>> this is not something we want to do in an upstream Linux kernel change.
> Do you mean it may break the thunk without setting "is_dgpu" flag in the
> hsa_gfxip_table for raven?

Exactly. Try running your modified KFD against an unchanged Thunk. If it
fails, you cannot make that kernel change.


>
>> I suggest using the ignore_crat module parameter for the workaround
>> instead. You may need to duplicate the raven_device_info and pick the
>> right one depending on whether it is a dGPU or an APU. The only
>> difference would be the need_iommu_device option. If ignore_crat is set,
>> you can support Raven as a dGPU and require a corresponding Thunk change
>> that conditionally support Raven as a dGPU.
>>
>> I think such a change would also be the right direction for supporting
>> Raven more universally in the future. It can be extended to
>> conditionally treat Raven as a dGPU automatically in some situations:
>>
>>   * broken or missing CRAT table
>>   * IOMMUv2 disabled
>>
>> Those are all situations where the current driver is broken anyway (and
>> always has been), so it would not be a kernel change that breaks
>> existing user mode.
> OK, I think I understand it. We should use a parameter/flag such as
> ignore_crat or something else to inform the user mode which path we should
> go, treat it as dgpu or apu. Then thunk can detect the flag from kernel to
> know how to handle the case. Am I right?

User mode should not look at the ignore_crat parameter. It is not part
of the user-kernel API. The kernel parameter is to control kernel
behaviour. The point is, that the default behaviour should not change,
because that would break existing user mode.

A fixed user mode can use existing topology information to figure out
which way to go. If the CUs are in the CPU node, it's an APU. Otherwise
it's a dGPU.


>
>> In addition the Thunk could be changed to downgrade a Raven APU to dGPU
>> (by splitting the APU node into a separate CPU and dGPU node) if other
>> dGPUs are present in the systems to disable all the APU-specific code
>> paths and allow all the GPUs to work together seamlessly with SVM.
> Thanks! Let me look at the thunk again.
>
> For now, based on your comments, I would like to update patch as:
>
> - Modify the ignore_crat parameter as tristate: "use", "ignore", and
>   "auto". Usually, by default is "auto = use", but in some special case
>   such as iommu v2 broken or no iommu v2 support yet (Renoir), we have to
>   set "auto = ignore". Then treat it as "dgpu". And if CRAT table is
>   broken/missing, we will fall back to treat it as "dgpu" as well.
>
> What do you think of it?

If the CRAT is broken/missing or IOMMUv2 is not enabled, "use" is not a
very useful setting. I think you only need two states: "auto" and
"ignore". That's pretty much what we have now. Except we need more
conditions to not use the CRAT in the "auto" case.

Regards,
  Felix


>
> Thanks,
> Ray
>
>> Regards,
>>   Felix
>>
>>
>>> Signed-off-by: Huang Rui <ray.huang at amd.com>
>>> ---
>>>  drivers/gpu/drm/amd/amdkfd/kfd_crat.c   | 6 ++++++
>>>  drivers/gpu/drm/amd/amdkfd/kfd_device.c | 4 ++--
>>>  2 files changed, 8 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_crat.c b/drivers/gpu/drm/amd/amdkfd/kfd_crat.c
>>> index 6a250f8fcfb8..66d9f7280fe8 100644
>>> --- a/drivers/gpu/drm/amd/amdkfd/kfd_crat.c
>>> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_crat.c
>>> @@ -22,6 +22,7 @@
>>>  
>>>  #include <linux/pci.h>
>>>  #include <linux/acpi.h>
>>> +#include <asm/processor.h>
>>>  #include "kfd_crat.h"
>>>  #include "kfd_priv.h"
>>>  #include "kfd_topology.h"
>>> @@ -781,6 +782,11 @@ int kfd_create_crat_image_acpi(void **crat_image, size_t *size)
>>>  		return -ENODATA;
>>>  	}
>>>  
>>> +	/* Raven series will go with the fallback path to use virtual CRAT */
>>> +	if (boot_cpu_data.x86_vendor == X86_VENDOR_AMD &&
>>> +	    boot_cpu_data.x86 == 0x17)
>>> +		return -ENODATA;
>>> +
>>>  	pcrat_image = kmemdup(crat_table, crat_table->length, GFP_KERNEL);
>>>  	if (!pcrat_image)
>>>  		return -ENOMEM;
>>> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device.c b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
>>> index d5e790f046b4..93179c928371 100644
>>> --- a/drivers/gpu/drm/amd/amdkfd/kfd_device.c
>>> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
>>> @@ -116,6 +116,7 @@ static const struct kfd_device_info carrizo_device_info = {
>>>  	.num_xgmi_sdma_engines = 0,
>>>  	.num_sdma_queues_per_engine = 2,
>>>  };
>>> +#endif
>>>  
>>>  static const struct kfd_device_info raven_device_info = {
>>>  	.asic_family = CHIP_RAVEN,
>>> @@ -128,13 +129,12 @@ static const struct kfd_device_info raven_device_info = {
>>>  	.num_of_watch_points = 4,
>>>  	.mqd_size_aligned = MQD_SIZE_ALIGNED,
>>>  	.supports_cwsr = true,
>>> -	.needs_iommu_device = true,
>>> +	.needs_iommu_device = false,
>>>  	.needs_pci_atomics = true,
>>>  	.num_sdma_engines = 1,
>>>  	.num_xgmi_sdma_engines = 0,
>>>  	.num_sdma_queues_per_engine = 2,
>>>  };
>>> -#endif
>>>  
>>>  static const struct kfd_device_info hawaii_device_info = {
>>>  	.asic_family = CHIP_HAWAII,


More information about the amd-gfx mailing list