[Intel-gfx] [PATCH 43/46] drm/i915: Allocate a status page for each timeline

Tue Jan 15 18:43:39 UTC 2019

Quoting John Harrison (2019-01-15 18:17:21)
> On 1/15/2019 01:50, Chris Wilson wrote:
> > Quoting John Harrison (2019-01-15 00:56:13)
> >> On 1/7/2019 03:55, Chris Wilson wrote:
> >>> +static int alloc_hwsp(struct i915_timeline *timeline)
> >>> +{
> >>> +     struct drm_i915_private *i915 = timeline->i915;
> >>> +     struct i915_vma *vma;
> >>> +     int offset;
> >>> +
> >>> +     mutex_lock(&i915->gt.timeline_lock);
> >>> +
> >>> +restart:
> >>> +     offset = find_first_cacheline(i915);
> >>> +     if (offset == NBITS && i915->gt.timeline_hwsp) {
> >>> +             i915_vma_put(i915->gt.timeline_hwsp);
> >>> +             i915->gt.timeline_hwsp = NULL;
> >>> +     }
> >>> +
> >>> +     vma = i915->gt.timeline_hwsp;
> >>> +     if (!vma) {
> >>> +             struct drm_i915_gem_object *bo;
> >>> +
> >>> +             /* Drop the lock before allocations */
> >>> +             mutex_unlock(&i915->gt.timeline_lock);
> >>> +
> >>> +             BUILD_BUG_ON(NBITS * CACHELINE_BYTES > PAGE_SIZE);
> >>> +             bo = i915_gem_object_create_internal(i915, PAGE_SIZE);
> >>> +             if (IS_ERR(bo))
> >>> +                     return PTR_ERR(bo);
> >>> +
> >>> +             i915_gem_object_set_cache_level(bo, I915_CACHE_LLC);
> >>> +
> >>> +             vma = i915_vma_instance(bo, &i915->ggtt.vm, NULL);
> >>> +             if (IS_ERR(vma))
> >>> +                     return PTR_ERR(vma);
> >>> +
> >>> +             mutex_lock(&i915->gt.timeline_lock);
> >>> +             if (i915->gt.timeline_hwsp) {
> >>> +                     i915_gem_object_put(bo);
> >>> +                     goto restart;
> >>> +             }
> >>> +
> >>> +             i915->gt.timeline_hwsp = vma;
> >>> +             i915->gt.timeline_free = ~0ull;
> >>> +             offset = 0;
> >>> +     }
> >>> +
> >>> +     i915->gt.timeline_free &= ~BIT_ULL(offset);
> >>> +
> >>> +     timeline->hwsp_ggtt = i915_vma_get(vma);
> >>> +     timeline->hwsp_offset = offset * CACHELINE_BYTES;
> >>> +
> >>> +     mutex_unlock(&i915->gt.timeline_lock);
> >>> +
> >>> +     return 0;
> >>> +}
> >> If I'm reading this correctly then gt.timeline_hwsp/free is the a cached
> >> copy of the most recently allocated but not yet filled bank of seqno
> >> locations. When it gets full, the i915->gt reference gets dropped and a
> >> new page is allocated and used up line by line. Meanwhile, each timeline
> >> has it's own private reference to the page so dropping the i915->gt
> >> reference is safe. And once the last timeline using a given page is
> >> freed, the last reference to that page will be dropped and so the page
> >> itself will also be freed. If a timeline is freed before the currently
> >> cached page is filled, then that timeline's slot will be released and
> >> re-used by the next timeline to be created.
> >>
> >> But what about the scenario of a long running system with a small but
> >> growing number of persistent tasks interspersed with many short lived
> >> tasks? In that case, you would end up with many sparsely populated pages
> >> that whose free slots will not get re-used. You could have a linked list
> >> of cached pages. When a page is filled, move it to a 'full' list. When a
> >> timeline is freed, if it's page was on the 'full' list, clear the slot
> >> and move it back to the 'available' list.
> > Yes. My thinking was a plain slab cache was a quick-and-dirty
> > improvement over a page-per-timeline. And a freelist would be the next
> > step.
> >
> >> Or is the idea that a worst case of a single page vma allocation per
> >> timeline is the least of our worries if there is an ever growing number
> >> of timelines/contexts/users in the system?
> > Nah, it was just an attempt to quickly reduce the number of allocations,
> > where the worst case of one page+vma per timeline was the starting
> > point.
> >
> > We should break this patch down into 1) one-page-per-timeline, 2) slab
> > cache, 3) free list 4) profit.
> >
> > At other times we have been wanting to be able to suballocate pages,
> > something to keep in mind would be extending this to arbitrary cacheline
> > allocations.
> The multi-stage approach sounds good. Keep things simple in this patch 
> and then improve the situation later. One thing to be careful of with a 
> cacheline allocator would be make sure whatever is being converted 
> wasn't using full pages for security reasons. I.e. a page can be private 
> to a process, a cacheline will be shared by many. I guess that would 
> only really apply to allocations being passed to user land as the kernel 
> is considered secure? Or can a user batch buffer write to arbitrary 
> locations within the ppHWSP and thereby splat someone else's seqno?

ppHWSP, yes. But for internal allocations, only accessible via the ring
+ GGTT, should be no problem. I agree that we definitely don't want to
expose subpage sharing across the userspace boundary (all isolation
controls are only on pages and above).

If userspace wants suballocations, it can (and does) do them for itself
and should regulate its own sharing.

> >>> +     if (global_hwsp) {
> >>> +             timeline->hwsp_ggtt = i915_vma_get(global_hwsp);
> >>> +             timeline->hwsp_offset = I915_GEM_HWS_SEQNO_ADDR;
> >>> +     } else {
> >>> +             err = alloc_hwsp(timeline);
> >>> +             if (err)
> >>> +                     return err;
> >>> +     }
> >>> +
> >>> +     vaddr = i915_gem_object_pin_map(timeline->hwsp_ggtt->obj, I915_MAP_WB);
> >>> +     if (IS_ERR(vaddr)) { /* leak the cacheline, but will clean up later */
> >> Can you explain this comment more? Where/when is the later?
> > On failure here, the cacheline is still marked as allocated in the slab,
> > but the reference to the page is released. So the backing page will be
> > released when everyone else finally drops their reference.
> >
> > Just laziness, since we have the ability to return the cacheline later
> > on...
> Meaning the actual leak is the bit in 'i915->gt.timeline_free' that says 
> this cacheline can or can't be used for the next allocation? Presumably 
> you could do the bit map munging in the case that 'global_hwsp' is null, 
> but the code would certainly be messier for not a lot of gain.

Having been pointed out that I was being lazy, a bit of refactoring
later showed how lazy I was.

> >>> @@ -2616,7 +2628,7 @@ static int execlists_context_deferred_alloc(struct i915_gem_context *ctx,
> >>>                goto error_deref_obj;
> >>>        }
> >>>    
> >>> -     timeline = i915_timeline_create(ctx->i915, ctx->name);
> >>> +     timeline = i915_timeline_create(ctx->i915, ctx->name, NULL);
> >> Why does this use the global HWSP rather than a per context one?
> > .global_hwsp = NULL => it allocates its own HWSP.
> >
> > Were you thinking of intel_engine_setup_common() which is still using
> > the global HWSP at this point in time?
> Doh. Brain fart. Presumably the engine one will disappear completely? Or 
> is it still needed for legacy mode?

It (the timeline embedded inside the engine) is killed later, once
the internal clients (perf/pmu, hangcheck and idling at the last count)
are ready for the lack of globally ordered execution queue. The single
ringbuffer + timeline persists for legacy. (Multiple timelines for gen7,
coming later!)
-Chris