Regression introduced by 90797e6d1ec0dfde6ba62a48b9ee3803887d6ed4 ("drm/i915: create compact dma scatter lists for gem objects") Was:Re: i915 mapping large (3MB) scatter list, hitting limits on certain IOMMUs that can only map contingous regions up to 2MB.

Mon Jun 24 08:04:28 PDT 2013

On Sat, Jun 22, 2013 at 03:22:59PM +0100, Chris Wilson wrote:
> On Fri, Jun 21, 2013 at 10:03:43PM -0400, Konrad Rzeszutek Wilk wrote:
> > On Fri, Jun 21, 2013 at 03:28:28PM -0400, Konrad Rzeszutek Wilk wrote:
> > > Hey,
> > 
> > CC-ing Imre,
> > 
> > Imre, your patch 90797e6d1ec0dfde6ba62a48b9ee3803887d6ed4
> > ("drm/i915: create compact dma scatter lists for gem objects") is the cause
> > of the regression.
> > 
> > If I revert your patch it boots fine without any trouble.
> > 
> > I am not entirely sure why that is - as I added some debug code in
> > lib/swiotlb.c to trigger when it can't find 3MB  area (which
> > is what I thought initially was the issue) - but none of the debug
> > code seems to be hit.
> > 
> > Any thoughts?
> 
> You should be hitting drivers/iommu/intel-iommu.c for the dma
> translation. It looks like the contiguous 3MiB segment will be special
> as it is the first sg that __domain_mapping() will attempt to allocate a
> superpage (2MiB) for. What goes wrong at point, I am not sure, but I
> would suggest peppering intel-iommu.c with printk to track down the error.

I figured it out. The issue was that I am backed by the SWIOTLB which
can only allocate up to 128*4K chunks of contingous bounce buffer
(see IO_TLB_SEGSIZE) - which means it can only setup up to 512kB buffers.
While one of the SG entries tries to give it one past that size (3MB).

The change Imre introduced assume that the CPU addresses (virtual) are
the same as the bus addresses. That is correct in most platforms, but some
(for example when booting a Xen PV guest with i915 as PCI passthrough)
the virt_to_phys() values != bus address. Which means that the nice check of:

                if (!i || page_to_pfn(page) != last_pfn + 1) {                  
                        if (i)                                                  
                                sg = sg_next(sg);                               
                        st->nents++;                                            
                        sg_set_page(sg, page, PAGE_SIZE, 0);                    
                } else {                                                        
                        sg->length += PAGE_SIZE;                                
                }         

is too simplistic. What it ought to do is consult the DMA API whether the
next PFN (page_to_pfn(page)) is really contingous in the DMA space. And also
that it does got past the DMA mask for said device.

Unfortunatly such calls do not exist. Those checks are all done when
dma_map_page_* is done (which is where it failed for me).

The best (for an rc7 stage) fix I came up with is to revert just a bit
of the old behavior and still retain the sg compact code behavior.
See following patch:

diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
index 970ad17..9edd2eb 100644
--- a/drivers/gpu/drm/i915/i915_gem.c
+++ b/drivers/gpu/drm/i915/i915_gem.c
@@ -1801,7 +1801,12 @@ i915_gem_object_get_pages_gtt(struct drm_i915_gem_object *obj)
 			gfp |= __GFP_NORETRY | __GFP_NOWARN | __GFP_NO_KSWAPD;
 			gfp &= ~(__GFP_IO | __GFP_WAIT);
 		}
-
+		if (swiotlb_nr_tbl()) {
+			st->nents++;
+			sg_set_page(sg, page, PAGE_SIZE, 0);
+			sg = sg_next(sg);
+			continue;
+		}
 		if (!i || page_to_pfn(page) != last_pfn + 1) {
 			if (i)
 				sg = sg_next(sg);
@@ -1813,7 +1818,8 @@ i915_gem_object_get_pages_gtt(struct drm_i915_gem_object *obj)
 		last_pfn = page_to_pfn(page);
 	}
 
-	sg_mark_end(sg);
+	if (!swiotlb_nr_tbl())
+		sg_mark_end(sg);
 	obj->pages = st;
 
 	if (i915_gem_object_needs_bit17_swizzle(obj))