[Mesa-dev] [PATCH 04/18] i965: Introduce a context-local batch manager

Wed Jul 8 01:51:07 PDT 2015

On Tue, Jul 07, 2015 at 10:03:09PM -0700, Kenneth Graunke wrote:
> Hi Chris,
> 
> I made a genuine effort to review this patch, hoping to better understand
> the various changes and what you were trying to accomplish.  I spent many
> hours reading and trying to enumerate changes - or potential changes I
> needed to look hard at to convince myself whether they were correct.
> 
> I came up with a frighteningly long list of changes:

* Locking completely overhauled.

> * Relocation handling changes considerably (the original point of
>   Kristian's endeavour which led up to this).

It's been a long time since 2011.

> * Fencing, busy tracking, and sync objects are completely reworked.

sync obje
That's a slight overstatement.

> * Render-to-texture cache flushing and dirty buffer tracking is
>   completely reworked.

And immensely simplified. A hash table? Did you notice that the current
dirty tracking misses where the same buffer is rewritten?

> * Gen7 SOL buffer offset resetting now uses MI_LOAD_REGISTER_IMM rather
>   than the execbuf2 parameter, requiring the command validator on Haswell.
>   This effectively bumps the kernel requirement from v3.6 to v4.2-rc1,
>   which will simply not fly with distributions at this time.

My bad, completely missed that there was EXT_transform_feedback that only
did the subset. I wondered why you had code to the pipelined loads/save
but then still flushed the batch every time.

> * glBufferSubData() now uses intel_upload_data() rather than allocating
>   a temporary BO.  This is the first use of the upload buffer by the
>   BLT engine, and could imply that the upload buffer's lifetime now
>   extends across batches - longer than before.  Separable change that
>   requires separate evaluation and justification.

Yes and no, it is easy to demonstrate that the buffer's lifetime is
longer than a batch current.

> * Per buffer cache-coherency checking rather than brw->has_llc?

We've pointed out the bugs in the current usage of brw->has_llc before.

> * glBufferSubData()'s prefer_stall_to_blit flag appears to depend on
>   per-buffer cache-coherency rather than being set globally.  Could
>   impact performance of buffer uploads.
> 
> * Potential missing flushes (which can cause hangs or misrendering):
> 
>   - It looks like calling brw_bo_busy() with BUSY_FLUSH causes a flush
>     when necessary.  However, some instances of the old bo_busy,
>     bo_references, batch_flush pattern are replaced without that flag.
>     One occurrance was in BufferSubData(); I did not spend time to
>     check every case.

The change is still accurate.

>   - Flushes are often done implicitly by e.g. brw_bo_read calling
>     brw_bo_map with the appropriate flags, and many explicit checks
>     and flushes are removed.  Not bad, but needs careful review.
>
>   - Gen6+ query object code might have dropped an implicit flush
>     guaranteeing that when the GL application requests the result,
>     any pending work will be kicked off so they can poll/spin
>     repeatedly until the result arrives.

The flush is not missing.

>   - New code to avoid redundant flushes.

> * perf_debug() warnings are removed all over the code for some reason:

The problem is perf_debug() was a layering violation (my goal was to
keep brw_batch a mini-library, either to extract it back to libdrm or
for reuse). I have an idea of using a macro like

#define PERF_DEBUG_LOC(brw, str) ((brw->perf_debug ? &(struct loc){__FILE__,
__FUNCTION__, __LINE__, str)} : NULL))

brw_bo_map(..., PERF_DEBUG_LOC(brw));

Then pass back to a brw_context callback if a function stalls.

>   - Unsynchronized maps/BufferSubData not working on !LLC platforms?
>     If they work now, that's a huge change!  If not, why drop the warning?

They have always worked on !llc. The same caveats with using multiple
aliasing to the same physical address through different modes of write
caching apply to both llc and !llc.

>   - Warnings about stalls on mapping buffers and miptrees are gone now.
>     These have been useful in tracking down performance problems.  They
>     might not always be accurate, but surely removing them should be done
>     separately with justification?

See issues with calling perf_debug.

>   - Warnings about stalls on query objects are gone.  I've used these when
>     analyzing application performance.  Why?

See issues with calling perf_debug.

>   - Warnings about implicit flushes are gone.

See issues with calling perf_debug.

It was at the back of my mind that I had to fix perf_debug - if I had
put in the list of issues, I would probably would have remembered to fix
it before posting.

> * BO unmap calls appears to be missing in some places.  A few map calls
>   have moved around in hard-to-follow ways.  Unclear how lifetimes of
>   buffers and lifetimes of maps are affected.

Bo unmap is a figment of your imagination. libdrm never unmaps a bo until
it is closed (simply to avoid the cost of refaulting the objects -
though that does come at some risk of address space exhaustion on 32bit
systems with large memory).

> * Possible mmap vs. pwrite preference changes?  Hard to follow.

But easy to reason. mmap is quicker if you have the mmapping and do not
require GGTT space, otherwise pwrite is preferred as it avoids the
mmaping setup cost and can avoid the cost of acquiring GGTT space. If
you have wb/wc mmaps, use them (and more importantly, reuse them).

> * Texture upload (tiled_memcpy) changes, which is notoriously fragile
>   and can lose all of the performance benefit if the compiler isn't able
>   to optimize it just right.  Ideally separate.

The change made will not affect compilation of that unit. But will speed
up byt by over 50%. And bdw by a couple orders of magnitude along some
call paths.

> * Assertions change to GL errors in brw_get_graphics_reset_status().
> 
> * Aperture space checking significantly reworked, especially for the BLT
>   paths.  Honestly, a lot nicer, but couldn't this be separated?

Not really, after all the purpose is to completely rework how relocations
are handled, and that is the current interface through which aperture
tracking works.

brw_batch_begin = { require_space; save_state }
brw_batch_end = { check aperture && fixup }

Every step there needs changing. We can either make the change in one
location, or at every callsite. But I can do the refactoring first.

> * The bo_reuse driconf option is removed.

You set it per context on the screen bufmgr. I left in the /* help I
have no idea if I can get an xml parser here */ to where it needed to be
applied if possible. The per-context bufmgr pointer was removed, it wasn't
clear how to set it on the screen bufmgr.

We can put it back to where it was and just use brw->intelScreen->bufmgr
(which is essentially the current code anyway). That just looks
problematic to me.

> * Gen4-5 structure changes.

Not intentional. What do you think changed?

> * brw_get_timestamp() - removes initialization of result to 0.
>   Probably unnecessary and OK to delete; should be separate.

Oh, that's a rebase error. I had rewritten the drm_intel_reg_read()
function then restored it back to the original to cut out one
unnecessary change.

> * New helper functions and coding patterns.  Separable.

Which? I had move some functions around, but the only new ones were
brw_batch.h and I had to add the start/end callbacks for brw_context.
The brw_context callbacks can easily be extracted and done in a
precursor.

> * Noise (renaming, moving code between files, some other trivial changes
>   like removing 'brw' variables and moving code into "else" blocks).
> 
> * ...I probably missed some things.
> 
> Based upon this, I cannot in good conscience consider merging this patch.
> The potential for breakage is staggering.  As a proof-of-concept, you've
> done an excellent job in proving we can do much better, and introduced a
> lot of good ideas.  But there's a lot of work left to be done before we
> can consider applying it to our production quality driver.
> 
> Please advise whether you would like to work towards making a mergeable,
> incremental patch series, or if someone else should embark on that
> endeavour.

I'm happy to keep reworking the noise outside of brw-batch down to
acceptable levels, but a bit more resistant to rewritting the
brw_request tracking itself though.

Please do keep on making suggestions for dubious code and what to cut.
Thanks,
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre