[Mesa-dev] draw: Replace varray and vcache by vsplit

Fri Aug 13 08:35:10 PDT 2010

On Fri, 2010-08-13 at 08:09 -0700, Chia-I Wu wrote:
> On Fri, Aug 13, 2010 at 10:51 PM, Keith Whitwell <keithw at vmware.com> wrote:
> > On Fri, 2010-08-13 at 07:46 -0700, Chia-I Wu wrote:
> >> On Fri, Aug 13, 2010 at 10:14 PM, Keith Whitwell <keithw at vmware.com> wrote:
> >> > On Fri, 2010-08-13 at 07:04 -0700, Chia-I Wu wrote:
> >> >> Hi,
> >> >>
> >> >> There are two primitive transformations in gallium draw module.  In
> >> >> varray, primitives are "split"ted.  When a primitive has more vertices
> >> >> than the middle end can handle, varray splits the primitive and calls
> >> >> the middle end multiple times.
> >> >>
> >> >> In vcache, primitives are "decompose"d.  More advanced primitives are
> >> >> decomposed into one of point, line(_adj), or triangle(_adj).
> >> >> Similarly, vcache may call the middle end multiple times to flush its
> >> >> internal buffer.  In some cases, vcache passes the primitves through
> >> >> without decomposing nor splitting, as can be seen in vcache_check_run.
> >> >>
> >> >> The issue with vcache is that it has to decompose a primitive
> >> >> differently depending on the provoking convention, as explained in
> >> >>
> >> >>   http://lists.freedesktop.org/archives/mesa-dev/2010-August/001797.html
> >> >>
> >> >> It becomes a problem when GS is active.
> >> >>
> >> >> My proposal is to make vcache split instead of decompose.  Because
> >> >> varray only splits and vcache has a pass-through path, the rest of the
> >> >> workflow already has to support all primitive types.  Switching from
> >> >> decompose to split does not require a big change to the rest of the
> >> >> workflow.
> >> >>
> >> >> But then vcache will look a lot like varray, only with indexed
> >> >> primitive support.  It leads me to a new frontend that replaces both
> >> >> varray and vcache: vsplit
> >> >>
> >> >>  http://cgit.freedesktop.org/~olv/mesa/log/?h=draw-vsplit
> >> >>
> >> >> vsplit is based on varray.  It uses some code from vcache to support
> >> >> indexed primitives.  When vcache decomposes, there are flags being set
> >> >> to indicate that if the stipple counter should be reset or if some
> >> >> edge of a triangle should be omitted in unfilled mode.  The segments
> >> >> of a splitted primitive have flags for similar purposes too:
> >> >>
> >> >>   DRAW_SPLIT_AFTER   More segments to come after this one
> >> >>   DRAW_SPLIT_BEFORE  There are preceding segments
> >> >>
> >> >> These flags are set by vsplit and the middle ends pass them to the
> >> >> other stages.  Therefore, the run methods of middle ends are augmented
> >> >> to take the flags.
> >> >>
> >> >> To summarize, vsplit
> >> >>
> >> >>  - fixes GS when (flatshade && flatshade_first) is on
> >> >>  - never sends more vertices than the middle end claims to handle
> >> >>  - is faster than vcache: split instead of decompose, no get_elt
> >> >>    calls
> >> >>  - no longer uses the higher bits of draw_elts for stipple/edge flags
> >> >>
> >> >> Suggestions?
> >> >
> >> >
> >> > Hi - I haven't looked at the patches yet, but a couple of questions:
> >> >
> >> > How does this interact with the draw_pipe_* code - which requires
> >> > decomposed primitives?
> >> draw_pipe.c decomposes the primitives.  It is there before because it
> >> has to support varray and vcache_check_run which do not decompose.
> >
> > OK.
> >
> >> > How does this cope with indexed rendering where the vertex buffers
> >> > themselves are too large (for hardware or some other entity)?  Eg.
> >> > imagine the hardware could cope with up to 64k vertices, and you have a
> >> > drawelements call randomly referencing vertices in range 0..128k ?
> >> Vertex fetching happens in the middle end so the range of the indices
> >> is not a problem.  Though vsplit guarantees that it never calls the
> >> middle end with more vertices than the middle end claims to support
> >> (as returned by draw_pt_middle_end::prepare).  The limit is usually
> >> decidied by the size of the buffer for vertex emitting.
> >
> > I guess I'm wondering how it does this.  If the middle end says it
> > supports 64k vertices, and the vertex element looks like
> >
> >  [0, 128k, 64k, 32k, 96k, 16k, 1, ... ]
> >
> > what gets sent?  (Sorry, I still haven't looked at the code, you could
> > well have addressed this).
> I see.  The frontend would set
> 
>    fetch_elts = [0, 128k, 64k, 32k, 96k, 16k, 1, ... ]
>    draw_elts = [0, 1, 2, 3, 4, 5, 6, ...]
> 
> fetch_elts is processed by the middle end and it will fetch the given
> vertices.  draw_elts will be passed to draw_emit or the pipeline.  It
> is the new index buffer, which indexes into the fetched vertices.
> 
> It is actual the same as vcache.  So when fetch_elts is
> 
>    [0, 128k, 64k, 64k, 128k, 16k, ...],
> 
> draw_elts would be set to
> 
>    [0, 1, 2, 2, 1, 3, ...]
> 
> The number of elements to fetch (and shade) is minimized.

Thanks Chia-I, I've taken a look at the code & this makes sense - the
fetch/draw cache is still there, but specialized into 4 versions for
each element type.  And it seems like you take some steps not to hit it
unnecessarily.  

I'm coming up to speed on it though, so a couple more questions - for
fan primitives, it seems like you always end up in the segment_cache
code -- is that true, or is there a fastpath I missed?  In particular,
if the whole fan fits within the limits of the middle end, will it still
end up going through the cache?

Actually it looks like this happens in an early-out at the bottom of the
patch:

+ /* no splitting required */
+ if (count <= max_count_simple) {
+ SEGMENT_SIMPLE(0x0, start, count);
+ }

where max_count_simple is either

  vsplit->max_vertices
or
  vsplit->segment_size  (for indexed primitives)

These in turn are generated as:

+ middle->prepare(middle, vsplit->prim, opt, &vsplit->max_vertices);
+
+ vsplit->segment_size = MIN2(SEGMENT_SIZE, vsplit->max_vertices);

and SEGMENT_SIZE is 1024.

So any indexed primitive where the number of vertices (or is it number
of indices) exceeds 1024, will end up on the cache path?  

I know this used to be true as well -- just wondering if there is a way
to improve on this...

Keith