[PATCH 13/14] staging: android: ion: Do not sync CPU cache on map/unmap

Tue Jan 22 22:56:00 UTC 2019

On Mon, 21 Jan 2019, Brian Starkey wrote:

> Hi,
> 
> Sorry for being a bit sporadic on this. I was out travelling last week
> with little time for email.
> 
> On Fri, Jan 18, 2019 at 11:16:31AM -0600, Andrew F. Davis wrote:
> > On 1/17/19 7:11 PM, Liam Mark wrote:
> > > On Thu, 17 Jan 2019, Andrew F. Davis wrote:
> > > 
> > >> On 1/16/19 4:54 PM, Liam Mark wrote:
> > >>> On Wed, 16 Jan 2019, Andrew F. Davis wrote:
> > >>>
> > >>>> On 1/16/19 9:19 AM, Brian Starkey wrote:
> > >>>>> Hi :-)
> > >>>>>
> > >>>>> On Tue, Jan 15, 2019 at 12:40:16PM -0600, Andrew F. Davis wrote:
> > >>>>>> On 1/15/19 12:38 PM, Andrew F. Davis wrote:
> > >>>>>>> On 1/15/19 11:45 AM, Liam Mark wrote:
> > >>>>>>>> On Tue, 15 Jan 2019, Andrew F. Davis wrote:
> > >>>>>>>>
> > >>>>>>>>> On 1/14/19 11:13 AM, Liam Mark wrote:
> > >>>>>>>>>> On Fri, 11 Jan 2019, Andrew F. Davis wrote:
> > >>>>>>>>>>
> > >>>>>>>>>>> Buffers may not be mapped from the CPU so skip cache maintenance here.
> > >>>>>>>>>>> Accesses from the CPU to a cached heap should be bracketed with
> > >>>>>>>>>>> {begin,end}_cpu_access calls so maintenance should not be needed anyway.
> > >>>>>>>>>>>
> > >>>>>>>>>>> Signed-off-by: Andrew F. Davis <afd at ti.com>
> > >>>>>>>>>>> ---
> > >>>>>>>>>>>  drivers/staging/android/ion/ion.c | 7 ++++---
> > >>>>>>>>>>>  1 file changed, 4 insertions(+), 3 deletions(-)
> > >>>>>>>>>>>
> > >>>>>>>>>>> diff --git a/drivers/staging/android/ion/ion.c b/drivers/staging/android/ion/ion.c
> > >>>>>>>>>>> index 14e48f6eb734..09cb5a8e2b09 100644
> > >>>>>>>>>>> --- a/drivers/staging/android/ion/ion.c
> > >>>>>>>>>>> +++ b/drivers/staging/android/ion/ion.c
> > >>>>>>>>>>> @@ -261,8 +261,8 @@ static struct sg_table *ion_map_dma_buf(struct dma_buf_attachment *attachment,
> > >>>>>>>>>>>  
> > >>>>>>>>>>>  	table = a->table;
> > >>>>>>>>>>>  
> > >>>>>>>>>>> -	if (!dma_map_sg(attachment->dev, table->sgl, table->nents,
> > >>>>>>>>>>> -			direction))
> > >>>>>>>>>>> +	if (!dma_map_sg_attrs(attachment->dev, table->sgl, table->nents,
> > >>>>>>>>>>> +			      direction, DMA_ATTR_SKIP_CPU_SYNC))
> > >>>>>>>>>>
> > >>>>>>>>>> Unfortunately I don't think you can do this for a couple reasons.
> > >>>>>>>>>> You can't rely on {begin,end}_cpu_access calls to do cache maintenance.
> > >>>>>>>>>> If the calls to {begin,end}_cpu_access were made before the call to 
> > >>>>>>>>>> dma_buf_attach then there won't have been a device attached so the calls 
> > >>>>>>>>>> to {begin,end}_cpu_access won't have done any cache maintenance.
> > >>>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>>> That should be okay though, if you have no attachments (or all
> > >>>>>>>>> attachments are IO-coherent) then there is no need for cache
> > >>>>>>>>> maintenance. Unless you mean a sequence where a non-io-coherent device
> > >>>>>>>>> is attached later after data has already been written. Does that
> > >>>>>>>>> sequence need supporting? 
> > >>>>>>>>
> > >>>>>>>> Yes, but also I think there are cases where CPU access can happen before 
> > >>>>>>>> in Android, but I will focus on later for now.
> > >>>>>>>>
> > >>>>>>>>> DMA-BUF doesn't have to allocate the backing
> > >>>>>>>>> memory until map_dma_buf() time, and that should only happen after all
> > >>>>>>>>> the devices have attached so it can know where to put the buffer. So we
> > >>>>>>>>> shouldn't expect any CPU access to buffers before all the devices are
> > >>>>>>>>> attached and mapped, right?
> > >>>>>>>>>
> > >>>>>>>>
> > >>>>>>>> Here is an example where CPU access can happen later in Android.
> > >>>>>>>>
> > >>>>>>>> Camera device records video -> software post processing -> video device 
> > >>>>>>>> (who does compression of raw data) and writes to a file
> > >>>>>>>>
> > >>>>>>>> In this example assume the buffer is cached and the devices are not 
> > >>>>>>>> IO-coherent (quite common).
> > >>>>>>>>
> > >>>>>>>
> > >>>>>>> This is the start of the problem, having cached mappings of memory that
> > >>>>>>> is also being accessed non-coherently is going to cause issues one way
> > >>>>>>> or another. On top of the speculative cache fills that have to be
> > >>>>>>> constantly fought back against with CMOs like below; some coherent
> > >>>>>>> interconnects behave badly when you mix coherent and non-coherent access
> > >>>>>>> (snoop filters get messed up).
> > >>>>>>>
> > >>>>>>> The solution is to either always have the addresses marked non-coherent
> > >>>>>>> (like device memory, no-map carveouts), or if you really want to use
> > >>>>>>> regular system memory allocated at runtime, then all cached mappings of
> > >>>>>>> it need to be dropped, even the kernel logical address (area as painful
> > >>>>>>> as that would be).
> > >>>>>
> > >>>>> Ouch :-( I wasn't aware about these potential interconnect issues. How
> > >>>>> "real" is that? It seems that we aren't really hitting that today on
> > >>>>> real devices.
> > >>>>>
> > >>>>
> > >>>> Sadly there is at least one real device like this now (TI AM654). We
> > >>>> spent some time working with the ARM interconnect spec designers to see
> > >>>> if this was allowed behavior, final conclusion was mixing coherent and
> > >>>> non-coherent accesses is never a good idea.. So we have been working to
> > >>>> try to minimize any cases of mixed attributes [0], if a region is
> > >>>> coherent then everyone in the system needs to treat it as such and
> > >>>> vice-versa, even clever CMO that work on other systems wont save you
> > >>>> here. :(
> > >>>>
> > >>>> [0] https://github.com/ARM-software/arm-trusted-firmware/pull/1553
> > >>>>
> 
> "Never a good idea" - but I think it should still be well defined by
> the ARMv8 ARM (Section B2.8). Does this apply to your system?
> 
> "If the mismatched attributes for a memory location all assign the
> same shareability attribute to a Location that has a cacheable
> attribute, any loss of uniprocessor semantics, ordering, or coherency
> within a shareability domain can be avoided by use of software cache
> management"
> 
> https://developer.arm.com/docs/ddi0487/latest/arm-architecture-reference-manual-armv8-for-armv8-a-architecture-profile
> 
> If the cache is invalidated when switching between access types,
> shouldn't the snoop filters get un-messed-up?
> 
> > >>>>
> > >>>>>>>
> > >>>>>>>> ION buffer is allocated.
> > >>>>>>>>
> > >>>>>>>> //Camera device records video
> > >>>>>>>> dma_buf_attach
> > >>>>>>>> dma_map_attachment (buffer needs to be cleaned)
> > >>>>>>>
> > >>>>>>> Why does the buffer need to be cleaned here? I just got through reading
> > >>>>>>> the thread linked by Laura in the other reply. I do like +Brian's
> > >>>>>>
> > >>>>>> Actually +Brian this time :)
> > >>>>>>
> > >>>>>>> suggestion of tracking if the buffer has had CPU access since the last
> > >>>>>>> time and only flushing the cache if it has. As unmapped heaps never get
> > >>>>>>> CPU mapped this would never be the case for unmapped heaps, it solves my
> > >>>>>>> problem.
> > >>>>>>>
> > >>>>>>>> [camera device writes to buffer]
> > >>>>>>>> dma_buf_unmap_attachment (buffer needs to be invalidated)
> > >>>>>>>
> > >>>>>>> It doesn't know there will be any further CPU access, it could get freed
> > >>>>>>> after this for all we know, the invalidate can be saved until the CPU
> > >>>>>>> requests access again.
> > >>>>>
> > >>>>> We don't have any API to allow the invalidate to happen on CPU access
> > >>>>> if all devices already detached. We need a struct device pointer to
> > >>>>> give to the DMA API, otherwise on arm64 there'll be no invalidate.
> > >>>>>
> > >>>>> I had a chat with a few people internally after the previous
> > >>>>> discussion with Liam. One suggestion was to use
> > >>>>> DMA_ATTR_SKIP_CPU_SYNC in unmap_dma_buf, but only if there's at least
> > >>>>> one other device attached (guarantees that we can do an invalidate in
> > >>>>> the future if begin_cpu_access is called). If the last device
> > >>>>> detaches, do a sync then.
> > >>>>>
> > >>>>> Conversely, in map_dma_buf, we would track if there was any CPU access
> > >>>>> and use/skip the sync appropriately.
> > >>>>>
> > >>>>
> > >>>> Now that I think this all through I agree this patch is probably wrong.
> > >>>> The real fix needs to be better handling in the dma_map_sg() to deal
> > >>>> with the case of the memory not being mapped (what I'm dealing with for
> > >>>> unmapped heaps), and for cases when the memory in question is not cached
> > >>>> (Liam's issue I think). For both these cases the dma_map_sg() does the
> > >>>> wrong thing.
> > >>>>
> > >>>>> I did start poking the code to check out how that would look, but then
> > >>>>> Christmas happened and I'm still catching back up.
> > >>>>>
> > >>>>>>>
> > >>>>>>>> dma_buf_detach  (device cannot stay attached because it is being sent down 
> > >>>>>>>> the pipeline and Camera doesn't know the end of the use case)
> > >>>>>>>>
> > >>>>>>>
> > >>>>>>> This seems like a broken use-case, I understand the desire to keep
> > >>>>>>> everything as modular as possible and separate the steps, but at this
> > >>>>>>> point no one owns this buffers backing memory, not the CPU or any
> > >>>>>>> device. I would go as far as to say DMA-BUF should be free now to
> > >>>>>>> de-allocate the backing storage if it wants, that way it could get ready
> > >>>>>>> for the next attachment, which may change the required backing memory
> > >>>>>>> completely.
> > >>>>>>>
> > >>>>>>> All devices should attach before the first mapping, and only let go
> > >>>>>>> after the task is complete, otherwise this buffers data needs copied off
> > >>>>>>> to a different location or the CPU needs to take ownership in-between.
> > >>>>>>>
> > >>>>>
> > >>>>> Yeah.. that's certainly the theory. Are there any DMA-BUF
> > >>>>> implementations which actually do that? I hear it quoted a lot,
> > >>>>> because that's what the docs say - but if the reality doesn't match
> > >>>>> it, maybe we should change the docs.
> > >>>>>
> > >>>>
> > >>>> Do you mean on the userspace side? I'm not sure, seems like Android
> > >>>> might be doing this wrong from what I can gather. From kernel side if
> > >>>> you mean the "de-allocate the backing storage", we will have some cases
> > >>>> like this soon, so I want to make sure userspace is not abusing DMA-BUF
> > >>>> in ways not specified in the documentation. Changing the docs to force
> > >>>> the backing memory to always be allocated breaks the central goal in
> > >>>> having attach/map in DMA-BUF separate.
> 
> Actually I meant in the kernel, in exporters. I haven't seen anyone
> using the API as it was intended (defer allocation until first map,
> migrate between different attachments, etc.). Mostly, backing storage
> seems to get allocated at the point of export, and device mappings are
> often held persistently (e.g. the DRM prime code maps the buffer at
> import time, and keeps it mapped: drm_gem_prime_import_dev).
> 
> I wasn't aware that CPU access before first device access was
> considered an abuse of the API - it seems like a valid thing to want
> to do.
> 
> > >>>>
> > >>>>>>>> //buffer is send down the pipeline
> > >>>>>>>>
> > >>>>>>>> // Usersapce software post processing occurs
> > >>>>>>>> mmap buffer
> > >>>>>>>
> > >>>>>>> Perhaps the invalidate should happen here in mmap.
> > >>>>>>>
> > >>>>>>>> DMA_BUF_IOCTL_SYNC IOCT with flags DMA_BUF_SYNC_START // No CMO since no 
> > >>>>>>>> devices attached to buffer
> > >>>>>>>
> > >>>>>>> And that should be okay, mmap does the sync, and if no devices are
> > >>>>>>> attached nothing could have changed the underlying memory in the
> > >>>>>>> mean-time, DMA_BUF_SYNC_START can safely be a no-op as they are.
> > >>>>>
> > >>>>> Yeah, that's true - so long as you did an invalidate in unmap_dma_buf.
> > >>>>> Liam was saying that it's too painful for them to do that every time a
> > >>>>> device unmaps - when in many cases (device->device, no CPU) it's not
> > >>>>> needed.
> > >>>>
> > >>>> Invalidates are painless, at least compared to a real cache flush, just
> > >>>> set the invalid bit vs actually writing out lines. I thought the issue
> > >>>> was on the map side.
> > >>>>
> > >>>
> > >>> Invalidates aren't painless for us because we have a coherent system cache 
> > >>> so clean lines get written out.
> > >>
> > >> That seems very broken, why would clean lines ever need to be written
> > >> out, that defeats the whole point of having the invalidate separate from
> > >> clean. How do you deal with stale cache lines? I guess in your case this
> > >> is what forces you to have to use uncached memory for DMA-able memory.
> > >>
> > > 
> > > My understanding is that our ARM invalidate is a clean + invalidate, I had 
> > > concerns about the clean lines being written to the the system cache as 
> > > part of the 'clean', but the following 'invalidate' would take care of 
> > > actually invalidating the lines (so nothign broken).
> > > But i am probably wrong on this and it is probably smart enough not to the 
> > > writing of the clean lines.
> > > 
> > 
> > You are correct that for a lot of ARM cores "invalidate" is always a
> > "clean + invalidate". At first I thought this was kinda silly as there
> > is now no way to mark a dirty line invalid without it getting written
> > out first, but if you think about it any dirty cache-line can be written
> > out (cleaned) at anytime anyway, so this doesn't actually change system
> > behavior. You should just not write to memory (make the line dirty)
> > anything you don't want eventually written out.
> > 
> > Point two, it's not just smart enough to not write-out clean lines, it
> > is guaranteed not to write them out by the spec. Otherwise since
> > cache-lines can be randomly filled if those same clean lines got written
> > out on invalidate operations there would be no way to maintain coherency
> > and things would be written over top each other all over the place.
> > 
> > > But regardless, targets supporting a coherent system cache is a legitamate 
> > > configuration and an invalidate on this configuration does have to go to 
> > > the bus to invalidate the system cache (which isn't free) so I dont' think
> > > you can make the assumption that invalidates are cheap so that it is okay 
> > > to do them (even if they are not needed) on every dma unmap.
> > > 
> > 
> > Very true, CMOs need to be broadcast to other coherent masters on a
> > coherent interconnect (and the interconnect itself if it has a cache as
> > well (L3)), so not 100% free, but almost, just the infinitesimal cost of
> > the cache tag check in hardware. If there are no non-coherent devices
> > attached then the CMOs are no-ops, if there are then the data needs to
> > be written out either way, doing it every access like is done with
> > uncached memory (- any write combining) will blow away any saving made
> > from the one less CMO. Either way you lose with uncached mappings of
> > memory. If I'm wrong I would love to know.
> > 
> 
> From what I understand, the current DMA APIs are not equipped to
> handle having coherent and non-coherent devices attached at the same
> time. The buffer is either in "CPU land" or "Device land", there's no
> smaller granule of "Coherent Device land" or "Non-Coherent Device
> land".
> 
> I think if there's devices which are making coherent accesses, and
> devices which are making non-coherent accesses, then we can't support
> them being attached at the same time without some enhancements to the
> APIs.
> 
> > >>> And these invalidates can occur on fairly large buffers.
> > >>>
> > >>> That is why we haven't went with using cached ION memory and "tracking CPU 
> > >>> access" because it only solves half the problem, ie there isn't a way to 
> > >>> safely skip the invalidate (because we can't read the future).
> > >>> Our solution was to go with uncached ION memory (when possible), but as 
> > >>> you can see in other discussions upstream support for uncached memory has
> > >>> its own issues.
> > >>>
> 
> @Liam, in your problematic use-cases, are both devices detached when
> the buffer moves between them?
> 
>  1) dev 1 map, access, unmap
>  2) dev 1 detach
>  3) (maybe) CPU access
>  4) dev 2 attach
>  5) dev 2 map, access
> 
> I still think a pretty pragmatic solution is to use
> DMA_ATTR_SKIP_CPU_SYNC until the last device detaches. That won't work
> if your access sequence looks like above...
> 

Yes the piplelining case in Android looks like the above.

> ...however, if your sequence looks like above, then you probably need
> to keep at least one of the devices attached anyway. 

It would be nice if a device could be kept attached, but that doesn't 
always work very well for an pipelining case as there isn't necessarily a 
good place for someone to know when to detach it.

> Otherwise, per
> the API, the buffer could get migrated after 2)/before 5). That will
> surely hurt a lot more than an invalidate.
> 

You are right, in theory it could but in practice it isn't getting 
migrated.
I understand the theoretical case, but it doesn't sounds clean as well to 
force clients to stay attached when there isn't a very nice way of knowing 
when to have them detach.

> > >>
> > >> Sounds like you need to fix upstream support then, finding a way to drop
> > >> all cacheable mappings of memory you want to make uncached mappings for
> > >> seems to be the only solution.
> > >>
> > > 
> > > I think we can probably agree that there woudln't be a good way to remove 
> > > cached mappings without causing an unacceptable performance degradation 
> > > since it would fragment all the nice 1GB kernel mappings we have.
> > > 
> > > So I am trying to find an alternative solution.
> > > 
> > 
> > I'm not sure there is a better solution. How hard is this solution to
> > implement anyway? The kernel already has to make gaps and cut up that
> > nice 1GB mapping when you make a reserved memory space in the lowmem
> > area, so all the logic is probably already implemented. Just need to
> > allow it to be hooked into from Ion when doing doing the uncached mappings.
> > 
> 
> I haven't looked recently, but I'm not sure the early memblock code
> can be reused as-is at runtime. I seem to remember it makes a bunch of
> assumptions about the fact that it's running "early".
> 
> If CPU uncached mappings of normal system memory is really the way
> forward, I could envisage a heap which maintains a pool of chunks of
> memory which it removed from the kernel mapping. The pool could grow
> (remove more pages from the kernel mapping)/shrink (add them back to
> the kernel mapping) as needed.
> 
> John Reitan implemented a compound-page heap, which used compaction to
> get a pool of 2MB contiguous pages. Something like that would at least
> prevent needing full 4kB granularity when removing things from the
> kernel mapping.
> 
> Even better, could it somehow be restricted to a region which is
> already fragmented? (e.g. the one which was used for the default CMA
> heap)
> 
> Thanks,
> -Brian
> 
> > >>>>>
> > >>>>>>>
> > >>>>>>>> [CPU reads/writes to the buffer]
> > >>>>>>>> DMA_BUF_IOCTL_SYNC IOCTL with flags DMA_BUF_SYNC_END // No CMO since no 
> > >>>>>>>> devices attached to buffer
> > >>>>>>>> munmap buffer
> > >>>>>>>>
> > >>>>>>>> //buffer is send down the pipeline
> > >>>>>>>> // Buffer is send to video device (who does compression of raw data) and 
> > >>>>>>>> writes to a file
> > >>>>>>>> dma_buf_attach
> > >>>>>>>> dma_map_attachment (buffer needs to be cleaned)
> > >>>>>>>> [video device writes to buffer]
> > >>>>>>>> dma_buf_unmap_attachment 
> > >>>>>>>> dma_buf_detach  (device cannot stay attached because it is being sent down 
> > >>>>>>>> the pipeline and Video doesn't know the end of the use case)
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>>>>> Also ION no longer provides DMA ready memory, so if you are not doing CPU 
> > >>>>>>>>>> access then there is no requirement (that I am aware of) for you to call 
> > >>>>>>>>>> {begin,end}_cpu_access before passing the buffer to the device and if this 
> > >>>>>>>>>> buffer is cached and your device is not IO-coherent then the cache maintenance
> > >>>>>>>>>> in ion_map_dma_buf and ion_unmap_dma_buf is required.
> > >>>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>>> If I am not doing any CPU access then why do I need CPU cache
> > >>>>>>>>> maintenance on the buffer?
> > >>>>>>>>>
> > >>>>>>>>
> > >>>>>>>> Because ION no longer provides DMA ready memory.
> > >>>>>>>> Take the above example.
> > >>>>>>>>
> > >>>>>>>> ION allocates memory from buddy allocator and requests zeroing.
> > >>>>>>>> Zeros are written to the cache.
> > >>>>>>>>
> > >>>>>>>> You pass the buffer to the camera device which is not IO-coherent.
> > >>>>>>>> The camera devices writes directly to the buffer in DDR.
> > >>>>>>>> Since you didn't clean the buffer a dirty cache line (one of the zeros) is 
> > >>>>>>>> evicted from the cache, this zero overwrites data the camera device has 
> > >>>>>>>> written which corrupts your data.
> > >>>>>>>>
> > >>>>>>>
> > >>>>>>> The zeroing *is* a CPU access, therefor it should handle the needed CMO
> > >>>>>>> for CPU access at the time of zeroing.
> > >>>>>>>
> > >>>>>
> > >>>>> Actually that should be at the point of the first non-coherent device
> > >>>>> mapping the buffer right? No point in doing CMO if the future accesses
> > >>>>> are coherent.
> > >>>>
> > >>>> I see your point, as long as the zeroing is guaranteed to be the first
> > >>>> access to this buffer then it should be safe.
> > >>>>
> > >>>> Andrew
> > >>>>
> > >>>>>
> > >>>>> Cheers,
> > >>>>> -Brian
> > >>>>>
> > >>>>>>> Andrew
> > >>>>>>>
> > >>>>>>>> Liam
> > >>>>>>>>
> > >>>>>>>> Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum,
> > >>>>>>>> a Linux Foundation Collaborative Project
> > >>>>>>>>
> > >>>>
> > >>>
> > >>> Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum,
> > >>> a Linux Foundation Collaborative Project
> > >>>
> > >>
> > > 
> > > Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum,
> > > a Linux Foundation Collaborative Project
> > > 
> 

Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum,
a Linux Foundation Collaborative Project