[Mesa-dev] Batch buffer sizes, flushing questions

Thu Oct 31 10:22:23 CET 2013

Hi,

 Thankyou for the detailed answer, and now I have still more questions:

> No.  do_flush_locked() (which is called by intel_batch_buffer_flush()) follows that by calling either drm_intel_bo_mrb_exec() or drm_intel_gem_bo_context_exec().  That's what > causes the batch to be queued for execution.

I think I am getting quite confused by the contents of intel_upload_finish(), for it has this:

   53
   54    if (brw->upload.buffer_len) {
   55       drm_intel_bo_subdata(brw->upload.bo,
   56                            brw->upload.buffer_offset,
   57                            brw->upload.buffer_len,
   58                            brw->upload.buffer);
   59       brw->upload.buffer_len = 0;
   60    }
   61
   62    drm_intel_bo_unreference(brw->upload.bo);
   63    brw->upload.bo = NULL;
   64 }

where as the batch buffer is represented by the member brw_context::batch (I think). What is the role of brw_context::upload? It looks like the size is limited to 4K, so what is it used to upload?

I can see those DRM execution commands in do_flush_locked(). That function's implementation is making me a touch confused too, for I see two uploads:

  244
  245    if (brw->has_llc) {
  246       drm_intel_bo_unmap(batch->bo);
  247    } else {
  248       ret = drm_intel_bo_subdata(batch->bo, 0, 4*batch->used, batch->map);
  249       if (ret == 0 && batch->state_batch_offset != batch->bo->size) {
  250      ret = drm_intel_bo_subdata(batch->bo,
  251                     batch->state_batch_offset,
  252                     batch->bo->size - batch->state_batch_offset,
  253                     (char *)batch->map + batch->state_batch_offset);
  254       }
  255    }

I understand the first "uploads batch->used uint32_t's from batch->map to the DRM memory object", but I do not quite follow the second upload; what is the magicks going on with batch->state_batch_offset and for that matter batch->bo->size ??

Going further down, I see that if the command is a blit it uses a different execution DRM command. I have not been able to find a reference of what each different DRM command does, the best I have found so far are: http://lwn.net/Articles/283798/ [Keith Packard's Article/Thread on LWN]  and https://www.kernel.org/doc/htmldocs/drm/ ; when I start to dig into the source code of DRM for what those functions do, I find they are set as function pointers and the chase eventually leads me to some ioctl like calls, but I still do not know what they do and the differences. Is there a reference or doc saying what these functions are expected to do?

> nr_prims is sometimes != 1 when the client is using the legacy glBegin()/glEnd() technique to emit primitives.  I don't recall the exact circumstances that cause it to happen, but
> here's one example:
>
> glBegin(GL_LINE_STRIP);
> glArrayElement(...);
> ...
> glEnd();
> glBegin(GL_LINE_LOOP);
> glArrayElement(...);
> ...
> glEnd();

That PITA old school begin/end. If the context is core profile, does that then imply nr_prims is always 1?

> Not that I'm aware of.  My intuition is that since GL apps typically do a very large number of small-ish draw calls, this wouldn't be beneficial most of the time, and it would be
> tricky to tune the heuristics to make it effective in the rare circumstances where it mattered without sacrificing performance elsewhere.

By small-ish calls, do you mean the batch buffer is small or the vertex or fragment load is small? Generally speaking, developers are supposed to keep the number of glDrawFoo() calls under 1000 per frame; on embedded they are in for a world of hurt if they go over 500 usually, and very often over 300 ends up being CPU limited on many embedded platforms. The calls that I am thinking that are "heavy"-ish are instanced calls where there are a large number of instances of non-trivial geometry, the most typical example is a field of grass.

> drm_intel_bo_busy() will tell if a buffer object is still being used by the GPU.  Also, calling drm_intel_bo_map() on a buffer will cause the CPU to wait until the GPU is done
> with the buffer.  (In the rare cases where we want to map a buffer object without waiting for the GPU we use drm_intel_gem_bo_map_unsynchronized()).

Just to check: are then GL buffer objects and texture surfaces implemented as DRM BO's? [Looking at the various functions specified in intelInitTextureSubImageFuncs,  intelInitTextureImageFuncs and intelInitBufferObjectFuncs makes me guess so, but it still is just a guess].

Looking at intel_bufferobj_subdata(), why does the change of buffer object data that is not used only happen async when brw_context::has_llc true?
Also why is preferring to stall more likely to hit that path than the delayed data blit?

Best Regards,
-Kevin