[PATCH] cleanup: Add 'struct dev' in the TTM layer to be passed in for DMA API calls.

Wed Mar 23 07:52:46 PDT 2011

On Wed, Mar 23, 2011 at 02:17:18PM +0100, Thomas Hellstrom wrote:
> On 03/23/2011 01:51 PM, Konrad Rzeszutek Wilk wrote:
> >>>I was thinking about this a bit after I found that the PowerPC requires
> >>>the 'struct dev'. But I got a question first, what do you with pages
> >>>that were allocated to a device that can do 64-bit DMA and then
> >>>move it to a device than can 32-bit DMA? Obviously the 32-bit card would
> >>>set the TTM_PAGE_FLAG_DMA32 flag, but the 64-bit would not. What is the
> >>>process then? Allocate a new page from the 32-bit device and then copy over the
> >>>page from the 64-bit TTM and put the 64-bit TTM page?
> >>Yes, in certain situations we need to copy, and if it's necessary in
> >>some cases to use coherent memory with a struct device assoicated
> >>with it, I agree it may be reasonable to do a copy in that case as
> >>well. I'm against, however, to make that the default case when
> >>running on bare metal.
> >This situation could occur on native/baremetal. When you say 'default
> >case' you mean for every type of page without consulting whether it
> >had the TTM_PAGE_FLAG_DMA32?
> 
> No, Basically I mean a device that runs perfectly fine with
> alloc_pages(DMA32) on bare metal shouldn't need to be using
> dma_alloc_coherent() on bare metal, because that would mean we'd need
> to take the copy path above.

I think we got the scenarios confused (or I did at least).
The scenario I used ("I was thinking.."), the 64-bit device would do
alloc_page(GFP_HIGHUSER) and if you were to move it to a 32-bit device
it would have to make a copy of the page as it could not reach the page
from GFP_HIGUSER.

The other scenario, which I think is what you are using, is that
we have a 32-bit device allocating a page, so TTM_PAGE_FLAG_DMA32 is set
and then we if we were to move it a 64-bit device it would need to
copied. But I don't think that is the case - the page would be
reachable by the 64-bit device. Help me out please if I am misunderstanding this.

> 
> >>However, I've looked a bit deeper into all this, and it looks like
> >>we already have other problems that need to be addressed, and that
> >>exists with the code already in git:
> >>
> >>Consider a situation where you allocate a cached DMA32 page from the
> >>ttm page allocator. You'll end up with a coherent page. Then you
> >>make it uncached and finally you return it to the ttm page
> >>allocator. Since it's uncached, it will not be freed by the dma api,
> >>but kept in the uncached pool, and later the incorrect page free
> >>function will be called.
> >Let me look in details in the code, but I thought it would check the
> >TTM_PAGE_FLAG_DMA32 and direct the page to the correct pool?
> >
> >We could piggyback on the idea of the struct I had and have these
> >values:
> >
> >struct ttm_page {
> >	struct page *page;
> >	dma_addr_t	*bus_addr;
> >	struct *ttm_pool	*origin;
> >}
> >
> >And the origin would point to the correct pool so that on de-allocate
> >it would divert it to the original one. Naturally there would
> >be some thinking to be done on the de-alloc path so that
> >the *origin isn't pointing to something htat has already been free-d.
> 
> The idea with these pools is to keep a cache of write-combined
> pages, so the pool
> is determined by the caching status of the page when it is returned
> to the page
> allocator.
> If we were to associate a page with a device, we'd basically need
> one such pool per
> device.

OK, I think I get it. The issue here is that with the struct I mentioned
above is that a page that was initially allocated as coherent, and then
you make it uncached, and if you were to free it - it would go back
to the pool that only deals with 'coherent' - which is wrong.

I had an earlier suggestion which was:

struct ttm_page {
	struct page *page;
	dma_addr_t	*bus_addr;
	struct *dev *dev;
}

this way even if it was to become uncached, you could still have it
associated with the right device. Granted at point the page is not
coherent anymore, it is uncached.

The problem is that on free-ing it is using the 'dma_free_coherent'
on a non-coherent page which it should not do. Would it be possible
then to make it coherent when we free it? Say as part of freeing
everything it would check what the original pool it belonged to
(so back to the struct with 'struct *ttm_pool *origin) and 
convert it back to coherent as part of "moving" it to the original
pool?

> >}
> 
> 
> >>I think we might need to take a few steps back and rethink this whole idea:
> >>
> >>1) How does this work in the AGP case? Let's say you allocate
> >>write-combined DMA32 pages from the ttm page pool (in this case you
> >>won't get coherent memory) and then use them in an AGP gart? Why is
> >>it that we don't need coherent pages then in the Xen case?
> >Hehe.. So I had posted a set of patches to carry the 'dma_addr_t' through
> >the AGP API and then to its backends to program that. And also the frontends
> >(so DRM, TTM) Here is the
> >patchset I posted some time ago:
> >
> >http://linux.derkeiler.com/Mailing-Lists/Kernel/2010-12/msg02382.html
> >and the discussion:
> >
> >http://linux.derkeiler.com/Mailing-Lists/Kernel/2010-12/msg02411.html
> >
> >Dave recommended I skip AGP and just concentrate on PCIe since not to many
> >folks use AGP anymore. Thought I realized that server boards use PCI
> >cards (ATI ES1000), which do utilize the AGP API. So there is breakage there
> >and I have a set of patches for this that I was in process of rebasing
> >on 2.6.39-rcX.
> 
> I see. Well, both in the AGP case and sometimes in the PCIe case,
> (some devices can use a non-cache-coherent PCIe mode for speed)
> it's a not too uncommon use case of TTM to alloc cached pages and
> transition them to write-combined (see impact below).
> 
> >>2) http://www.mjmwired.net/kernel/Documentation/DMA-API.txt, line 33
> >>makes me scared.
> >>We should identify what platforms may have problems with this.
> >Right.. I think nobody much thought about this in context of TTM since
> >that was only used on X86. I can take a look at the DMA API's of the
> >other two major platforms: IA64 and PPC and see what lurks there.
> >
> >>3) When hacking on the unichrome DMA engine it wasn't that hard to
> >>use the synchronization functions of the DMA api correctly:
> >>
> >>  When binding a TTM, the backend calls dma_map_page() on pages, When
> >>unbinding, the backend calls dma_unmap_page(), If we need cpu access
> >>when bound, we need to call dma_sync_single_for_[cpu|device]. If
> >>this is done, it will be harder to implement user-space
> >>sub-allocation, but possible. There will be a performance loss on
> >>some platforms, though.
> >Yup. That was my other suggestion about this. But I had no idea
> >where to sprinkle those 'dma_sync_single_[*]' calls, as they would
> >have been done in the drivers. Probably on its DMA paths, right before
> >telling the GPU to process the CP, and when receiving an interrupt
> >when the CP has been completed.
> 
> In this case dma_sync_single_for_device() would be needed for a bo
> when it was validated for a
> PCI dma engine. At the same time, all user-space virtual mappings of
> the buffer would need to be killed.

We need to kill the user-space virtual mappings b/c they might contain
the wrong caching state? And since they are an alias of the "synced"
page, the "synced" page might be uncached, while the alias is "write-combined".

Ugh. But I thought the WB or WC is done on the pages, not on the
mappings. So it would use the MTRR or PAT, which deal with physical addresses.
So it would not matter what the alias thought it was?

> A dma_sync_single_for_cpu() would be needed as part of making a bo idle.
> 
> 
> /Thomas