[Mesa-dev] Batch buffer sizes, flushing questions

Paul Berry stereotype441 at gmail.com
Thu Oct 31 15:23:17 CET 2013


On 31 October 2013 02:22, Rogovin, Kevin <kevin.rogovin at intel.com> wrote:

>    Hi,
>
>  Thankyou for the detailed answer, and now I have still more questions:
>
>  > No.  do_flush_locked() (which is called by intel_batch_buffer_flush())
> follows that by calling either drm_intel_bo_mrb_exec() or
> drm_intel_gem_bo_context_exec().  That's what > causes the batch to be
> queued for execution.
>
> I think I am getting quite confused by the contents of
> intel_upload_finish(), for it has this:
>
>
>    53
>    54    if (brw->upload.buffer_len) {
>    55       drm_intel_bo_subdata(brw->upload.bo,
>    56                            brw->upload.buffer_offset,
>    57                            brw->upload.buffer_len,
>    58                            brw->upload.buffer);
>    59       brw->upload.buffer_len = 0;
>    60    }
>    61
>    62    drm_intel_bo_unreference(brw->upload.bo);
>    63    brw->upload.bo = NULL;
>    64 }
>
> where as the batch buffer is represented by the member brw_context::batch
> (I think). What is the role of brw_context::upload?
>

The one and only role of brw_context::upload is to handle vertex and index
data that is passed into OpenGL using client pointers by clients that
aren't using ARB_vertex_buffer_object to manaage their vertex and index
data in GPU buffers.  At the time of the draw call, the driver has to copy
any such client-owned data into a newly-allocated buffer object for
consumption by the hardware.  brw->upload.bo is that buffer object.  The
remaining fields in brw->upload are a bookkeeping mechanism to allow small
chunks of vertex and index data to be shared in the same buffer object so
that we don't waste memory.


> It looks like the size is limited to 4K, so what is it used to upload?
>

Actually there are two mechanisms used by intel_upload_data() to store data
in brw->upload.bo, depending on the size of the data being stored.  If the
size is 4k or more, then it is uploaded directly to brw->upload.bo using
drm_intel_bo_subdata().  If it's less than 4k, it is memcpy'ed to
brw->upload.buffer, which acts as a temporary staging area; when
brw->upload.buffer gets full (or it's time to flush the batch), the
contents of brw->upload.buffer are uploaded to brw->upload.bo using
drm_intel_bo_subdata().  This allows us to reduce the overhead of calling
drm_intel_bo_subdata() when uploading tiny chunks of data at a time.


>
> I can see those DRM execution commands in do_flush_locked(). That
> function's implementation is making me a touch confused too, for I see two
> uploads:
>
>
>   244
>   245    if (brw->has_llc) {
>   246       drm_intel_bo_unmap(batch->bo);
>   247    } else {
>   248       ret = drm_intel_bo_subdata(batch->bo, 0, 4*batch->used,
> batch->map);
>   249       if (ret == 0 && batch->state_batch_offset != batch->bo->size) {
>   250      ret = drm_intel_bo_subdata(batch->bo,
>   251                     batch->state_batch_offset,
>   252                     batch->bo->size - batch->state_batch_offset,
>   253                     (char *)batch->map + batch->state_batch_offset);
>   254       }
>   255    }
>
> I understand the first "uploads batch->used uint32_t's from batch->map to
> the DRM memory object", but I do not quite follow the second upload; what
> is the magicks going on with batch->state_batch_offset and for that matter
> batch->bo->size ??
>

It looks like Abdiel answered this question, but to provide some more
context: we actually store more data in the batch buffer than just the
stuff that the hardware docs would describe as "batch commands".  We also
use the batch buffer to store a lot of smaller data structures that are
pointed to by batch commands, such as surface states, binding tables, cc
("color calculator") state, blend state, and so on (the docs typically
refer to this kind of stuff as "dynamic state").  The way we achieve this
is the "stack and heap" model: the batch commands are allocated starting at
the start of the batch buffer and moving towards higher addresses (like the
heap in a conventional UNIX process), and the dynamic state is allocated
starting at the end of the batch buffer and moving towards lower addresses
(like the stack in a conventional UNIX process).  batch->used is the "top
of heap pointer" (the buffer offset where the next batch command will be
written), and batch->state_batch_offset is the "stack pointer" (the buffer
offset where the most recently written piece of dynamic state is located).
When these pointers meet in the middle of the batch, we know we are out of
room in the batch buffer so we flush it.  To confuse you, batch->used is
measured in uint32's, and state_batch_offset is measured in bytes.

Now, to explain the code you've quoted above.  If the hardware has an LLC
("last level cache"), then the way we build this batch buffer is by mapping
the buffer object into CPU memory and writing to it directly.  So when
we're ready to send the batch to the hardware, all we need to do is unmap
it and we're ready to go.  However, if the hardware doesn't have an LLC,
building the batch buffer in that fashion would give poor performance (not
exactly sure why--I've never asked).  So instead we build the batch buffer
in a temporarily-malloc'ed piece of CPU memory, and when we're ready to
send it to the hardware we copy it into the actual buffer object using
drm_intel_bo_subdata().  The reason we do two drm_intel_bo_subdata() calls
is so that we won't waste time copying the unused region of memory between
batch->used and batch->state_batch_offset.


>
> Going further down, I see that if the command is a blit it uses a
> different execution DRM command. I have not been able to find a reference
> of what each different DRM command does, the best I have found so far are:
> http://lwn.net/Articles/283798/ [Keith Packard's Article/Thread on LWN]
> and https://www.kernel.org/doc/htmldocs/drm/ ; when I start to dig into
> the source code of DRM for what those functions do, I find they are set as
> function pointers and the chase eventually leads me to some ioctl like
> calls, but I still do not know what they do and the differences. Is there a
> reference or doc saying what these functions are expected to do?
>

They aren't very well documented.  Your best bet is to dig in the code and
ask specific questions if you get lost, like you're doing.  In the specific
case of submitting the batch, I believe the only difference between
drm_intel_bo_mrb_exec() and drm_intel_gem_bo_context_exec() is that one of
them is a newer API than the other; down in DRM code, they both wind up
calling do_exec2().


>
>
>  > nr_prims is sometimes != 1 when the client is using the legacy
> glBegin()/glEnd() technique to emit primitives.  I don't recall the exact
> circumstances that cause it to happen, but
> > here's one example:
> >
> > glBegin(GL_LINE_STRIP);
> > glArrayElement(...);
> > ...
> > glEnd();
> > glBegin(GL_LINE_LOOP);
> > glArrayElement(...);
> > ...
> > glEnd();
>
> That PITA old school begin/end. If the context is core profile, does that
> then imply nr_prims is always 1?
>

I think so, yes.


>
>
>  > Not that I'm aware of.  My intuition is that since GL apps typically
> do a very large number of small-ish draw calls, this wouldn't be beneficial
> most of the time, and it would be
> > tricky to tune the heuristics to make it effective in the rare
> circumstances where it mattered without sacrificing performance elsewhere.
>
> By small-ish calls, do you mean the batch buffer is small or the vertex or
> fragment load is small? Generally speaking, developers are supposed to keep
> the number of glDrawFoo() calls under 1000 per frame; on embedded they are
> in for a world of hurt if they go over 500 usually, and very often over 300
> ends up being CPU limited on many embedded platforms. The calls that I am
> thinking that are "heavy"-ish are instanced calls where there are a large
> number of instances of non-trivial geometry, the most typical example is a
> field of grass.
>

I don't know.  To be honest I was speculating beyond my realm of expertise
with this question.  Perhaps Ken or Eric would have more insight.


>
>
>  > drm_intel_bo_busy() will tell if a buffer object is still being used
> by the GPU.  Also, calling drm_intel_bo_map() on a buffer will cause the
> CPU to wait until the GPU is done
> > with the buffer.  (In the rare cases where we want to map a buffer
> object without waiting for the GPU we use
> drm_intel_gem_bo_map_unsynchronized()).
>
> Just to check: are then GL buffer objects and texture surfaces implemented
> as DRM BO's? [Looking at the various functions specified in
> intelInitTextureSubImageFuncs,  intelInitTextureImageFuncs and
> intelInitBufferObjectFuncs makes me guess so, but it still is just a guess].
>

Yes, exactly.


>
> Looking at intel_bufferobj_subdata(), why does the change of buffer object
> data that is not used only happen async when brw_context::has_llc true?
> Also why is preferring to stall more likely to hit that path than the
> delayed data blit?
>

Eric is probably your best bet for answering these questions.


>
> Best Regards,
> -Kevin
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.freedesktop.org/archives/mesa-dev/attachments/20131031/5ddf1496/attachment-0001.html>


More information about the mesa-dev mailing list