[PATCH V2 01/10] accel/amdxdna: Add a new driver for AMD AI Engine

Wed Aug 14 21:53:49 UTC 2024

On 8/14/2024 2:24 PM, Lizhi Hou wrote:
> 
> On 8/14/24 11:46, Jeffrey Hugo wrote:
>> On 8/14/2024 12:16 PM, Lizhi Hou wrote:
>>>
>>> On 8/9/24 09:11, Jeffrey Hugo wrote:
>>>> On 8/5/2024 11:39 AM, Lizhi Hou wrote:
>>>>> diff --git a/drivers/accel/amdxdna/aie2_pci.c 
>>>>> b/drivers/accel/amdxdna/aie2_pci.c
>>>>> new file mode 100644
>>>>> index 000000000000..3660967c00e6
>>>>> --- /dev/null
>>>>> +++ b/drivers/accel/amdxdna/aie2_pci.c
>>>>> @@ -0,0 +1,182 @@
>>>>> +// SPDX-License-Identifier: GPL-2.0
>>>>> +/*
>>>>> + * Copyright (C) 2023-2024, Advanced Micro Devices, Inc.
>>>>> + */
>>>>> +
>>>>> +#include <linux/amd-iommu.h>
>>>>> +#include <linux/errno.h>
>>>>> +#include <linux/firmware.h>
>>>>> +#include <linux/iommu.h>
>>>>
>>>> You are clearly missing linux/pci.h and I suspect many more.
>>> Other headers are indirectly included by "aie2_pci.h" underneath.
>>
>> aie2_pci.h also does not directly include linux/pci.h
> 
> it is aie2_pci.h --> amdxdna_pci_drv.h --> linux/pci.h.
> 
> It compiles without any issue.

I did not doubt that it compiled.  To be clear, I'm pointing out what I 
believe is poor style - relying on implicit includes.

There is no reason that amdxdna_pci_drv.h needs to include linux/pci.h 
(at-least in this patch).  That header file doesn't use any of the 
content from pci.h.  If this were merged, I suspect it would be valid 
for someone to post a cleanup patch that removes pci.h from 
amdxdna_pci_drv.h.

The problem comes when someone does a tree wide refactor of some header 
- perhaps moving a function out of pci.h into something else.  If they 
grep the source tree, they'll find amdxdna_pci_drv.h includes pci.h but 
really doesn't use it.  They likely won't see aie2_pci.c which may break 
because of the refactor.  Even more problematic is if pci.h is including 
something that you need, and you are not including it anywhere.  The 
code will still compile, but maybe in the next kernel cycle pci.h no 
longer includes that thing.  Your code will break.

The 4 includes you have here seems entirely too little, and I'm not 
clearly seeing the logic of what gets explicitly included vs what is 
implicitly included.  firmware.h is explicitly included, but pci.h is 
not, yet it seems like you use a lot more from pci.h.

There is the include what you use project that attempts to automate 
this, although I don't know how well it works with kernel code - 
https://github.com/include-what-you-use/include-what-you-use

> 
>>
>>>>> +
>>>>> +    ret = dma_set_mask_and_coherent(&pdev->dev, DMA_BIT_MASK(64));
>>>>> +    if (ret) {
>>>>> +        XDNA_ERR(xdna, "Failed to set DMA mask: %d", ret);
>>>>> +        goto release_fw;
>>>>> +    }
>>>>> +
>>>>> +    nvec = pci_msix_vec_count(pdev);
>>>>
>>>> This feels weird.  Can your device advertise variable number of 
>>>> MSI-X vectors?  It only works if all of the vectors are used?
>>> That is possible. the driver supports different hardware. And the fw 
>>> assigns vector for hardware context dynamically. So the driver needs 
>>> to allocate all vectors ahead.
>>
>> So, if the device requests N MSIs, but the host is only able to 
>> satisfy 1 (or some number less than N), the fw is completely unable to 
>> function?
> The fw may return interrupt 2 is assigned to hardware context. Then the 
> driver may not deal with it in this case. I think it is ok to fail if 
> the system has very limited resource.

Ok.  I suspect you'll want to change that behavior in the future with a 
fw update, but if this is how things work today, then this is how the 
driver must be.