[PATCH v2 5/8] accel/qaic: Add datapath

Wed Mar 1 18:14:35 UTC 2023

On 3/1/2023 10:05 AM, Stanislaw Gruszka wrote:
> On Wed, Mar 01, 2023 at 09:08:03AM -0700, Jeffrey Hugo wrote:
>>> This looks a bit suspicious. Are you sure you can modify
>>> sg->dma_address and still use it as valid value ?
>>
>> A single entry in the sg table is a contiguous mapping of memory.  If it
>> wasn't contiguous, it would have to be broken up into multiple entries.  In
>> the simple case, a driver is going to take the dma_address/len pair and hand
>> that directly to the device.  Then the device is going to access every
>> address in that range.
>>
>> If the device can access every address from dma_address to dma_address +
>> len, why can't it access a subset of that?
> 
> Required address alignment can be broken. Not sure if only that.

AIC100 doesn't have required alignment.  AIC100 can access any 64-bit 
address, at a byte level granularity.  The only restriction AIC100 has 
is that the size of a transfer is restricted to a 32-bit value, so max 
individual transfer size of 4GB.  Transferring more than 4GB requires 
multiple transactions.

>>>> Are you suggesting renaming
>>>> this function?  I guess I'm not quite understanding your comment here. Can
>>>> you elaborate?
>>>
>>> Renaming would be nice. I was thinking by simplifying it, not sure
>>> now if that's easy achievable, though.
>>
>> Ok.  I'll think on this.
> 
> Maybe this function could be removed ? And create sg lists
> that hardware can handle without any modification.
> Just idea to consider, not any requirement.

Ok, so this is part of our "slicing" operation, and thus required.

Maybe how slicing works is not clear.

Lets say that we have a workload on AIC100 that can identify a license 
plate in a picture (aka lprnet).  Lets assume this workload only needs 
the RGB values of a RGBA file (a "jpeg" we are processing).

Userspace allocates a BO to hold the entire file.  A quarter of the file 
is R values, a quarter is G values, etc.  For simplicity, lets assume 
the R values are all sequentially listed, then the G values, then the B 
values, finally the A values.  When we allocate the BO, we map it once. 
  If we have an IOMMU, this optimizes the IOMMU mappings.  BOs can be 
quite large.  We have some test workloads based on real world workloads 
where each BO is 16-32M in size, and there are multiple BOs.  I don't 
want to map a 32M BO N duplicate times in the IOMMU.

So, now userspace slices the BO.  It tells us we need to transfer the 
RGB values (the first 75% of the BO), but not the A values.  So, we 
create a copy of the mapped SG and edit it to represent this transfer, 
which is a subset of the entire BO.  Using the slice information and the 
mapping information, we construct the DMA engine commands that can be 
used to transfer the relevant portions of the BO to the device.

It sounds like you are suggesting, lets flip this around.  Don't map the 
entire BO once.  Instead, wait for the slice info from userspace, 
construct a sg list based on the parts of the BO for the slice, and map 
that.  Then the driver gets a mapped SG it can just use.  The issue I 
see with that is slices can overlap.  You can transfer the same part of 
a BO multiple times.  Maybe lprnet has multiple threads on AIC100 where 
thread A consumes R data, thread B consumes R and G data, and thread C 
consumes B data.  We need to transfer the R data twice to different 
device locations so that threads A and B can consume the R data 
independently.

If we map per slice, we are going to map the R part of the BO twice in 
the IOMMU.  Is that valid?  It feels possible that there exists some 
IOMMU implementation that won't allow multiple IOVAs to map to the same 
DDR PA because that is weird and the implementer thinks its a software 
bug.  I don't want to run into that.  Assuming it is valid, that is 
multiple mappings in the IOMMU TLB which could have been a single 
mapping.  We are wasting IOMMU resources.

There are some ARM systems we support with limited IOVA space in the 
IOMMU, and we've had some issues with exhausting that space.  The 
current implementation is influenced by those experiences.

I appreciate the outsider perspective (here and elsewhere). 
Re-evaluating some of these things is resulting in improvements.

-Jeff