[PATCH 15/87] drm/i915: Pack the partial view size and offset into a single u64

Fri Jan 6 16:00:47 UTC 2017

On Fri, Jan 06, 2017 at 03:27:12PM +0000, Tvrtko Ursulin wrote:
> 
> On 06/01/2017 14:21, Chris Wilson wrote:
> >On Fri, Jan 06, 2017 at 01:50:27PM +0000, Tvrtko Ursulin wrote:
> >>>>I am not sure the complication is worth the small saving. Don't we
> >>>>ever expect to start creating >8MiB partials? If one day we get an
> >>>>efficient way of querying for the biggest hole we could start doing
> >>>>that and then this would have to be undone.
> >>>
> >>>The current code works best with fixed size partials (fenceable chunks)
> >>>as the issue with variable sized partials is the lookup cache efficiency.
> >>
> >>You mean the VMA rb tree? But fewer VMAs and fewer faults (larger
> >>partials) surely must be better. 1MiB chinks was AFAIR just a
> >>starting point.
> >
> >Depends. Bigger objects in the GGTT means more thrashing leaving PTEs
> >unused between cycles. Remember we still do the try whole object first
> >and only use partials if forced - which typically means we are in the
> >contention scenario where objects are likely to be paged in and out of
> >the GTT.
> 
> It depends on the usage pattern. Perhaps it is just one huge object
> which can't fit and then we do it in 1 MiB chunks while we could do
> it in bigger ones instead. Maybe it doesn't matter for such huge
> objects, in fact maybe it doesn't hugely matter for mmap_gtt at all.

Exactly. Where it will impact is if we cause contention elsewhere via
the locks. GTT mapping is nasty because of aperture being such a small
resource (and before gen8, aperture thrashing causes GPU stalls).

> But then the question is why bother with saving a few bytes on the
> memcpy. And the ggtt view type == sizeof trickery is what brings the
> main saving anyway. (From memcpy on 32 bytes, to memcpy on 12 bytes.
> This patch only goes further from 12 to 8, even though it comes
> before it.)

It was easier to pack the bits than write the BUILD_BUG to prove the
compiler didn't insert padding.

> >Improvements on the fault path should be primarily on shrinking the
> >lock, ideally avoiding struct_mutex, adding a vm lock and fence lock
> >(with more atomic pin counting),  to improve concurrency of faulters and
> >avoid stalling others.
> 
> I agree with this as a priority, but wouldn't necessarily preclude
> using the largest hole as the partial size.

I do not think largest hole is a good idea, that guarantees thrashing.

> >>>The packing is to ensure we have no unused bits in the memcmp() inside
> >>>i915_vma_compare.
> >>
> >>Do you think it makes a difference? You were initially reluctant to
> >>let that patch where I save a function call in the lookup in
> >>straight away AFAIR.
> >
> >Because you were optimising the wrong thing. The primary user of
> >vma_compare is for ggtt lookups, execbuf needs a O(1) path from handle
> >to vma (which only used the vma rbtree for its initial creation).
> 
> Well I saved a function call in a dominant path in the current code
> base. Don't see how is that a wrong thing to optimise.

Hmm, the patch for O(1) is almost 2 years old. It feels such a long time
ago...
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre