[Mesa-dev] Gallium proposal: add a user pointer in pipe_resource

Tue Feb 1 09:55:30 PST 2011

On Mon, 2011-01-31 at 10:46 -0800, Marek Olšák wrote:
> Hi Keith,
> 
> From my point of view, user buffers are just pointers passed through
> the Gallium interface and are well-defined from what I can see. They
> might be owned by the application (e.g. set via glVertexPointer etc.),
> therefore using the transfer functions on user buffers is invalid per
> se. Moreover, the application may change the content of user buffers
> any time,

Up until now we've always worked as if user buffers were not mutable
either by the application or the driver.  This means that userbuffers
behave very much like normal buffers which have some initial data but no
transfer mechanism.  

One upshot of this is that the driver can safely promote userbuffers to
true buffers in a one-off operation. A second upshot is that userbuffers
which may change will need to be deleted & recreated by the
state-tracker.

This is more expressive than a situation where the driver has to always
assume that userbuffers may have changed between arbitrary draw calls --
if the buffer is being reused, the driver knows it has not changed.

>  meaning that drivers should convert the user buffers to real buffers
> in the draw_vbo function, then draw vertices, and then forget the real
> buffers, keeping the user buffers bound for the next draw operation.
> Drivers should not upload user buffers anywhere else, because the
> application may change the contents between glDraw* calls. If they are
> bound as vertex buffers, we don't need to know their size and
> sometimes we even can't (again, glVertexPointer etc.). Instead, we can
> use pipe_draw_info::min_index and max_index and upload only that
> range. This has proved to be a great optimization and it's how r300g
> and r600g work now.

In what sense is this different to having the state-tracker destroy and
recreate the userbuffer around each draw call?  Is it just the overhead
of the CALLOC call to create the pipe_resource struct?  Otherwise the
behaviour inside draw_vbo() looks identical - the same number of
userbuffers get uploaded.  In both cases the min/max_index values can be
used to minimize the uploaded region.

> In theory, doing user buffer uploads at the state tracker side using
> inline transfers might work and should remove some burden from
> drivers. 

This would be an alternate approach -- the state-tracker could itself
figure out min/max_index, and upload that data into a real hardware
buffer -- basically the same task that the driver is doing in both
examples above.

> In practice, inline transfers may have a very large overhead compared
> to how things work now. In transfer_inline_write, you usually want to
> map the buffer, do a memcpy, and unmap it. The map/unmap overhead can
> be really significant. There are applications that use glDrawElements
> to draw one triangle at a time, and they draw hundreds of triangles
> with user buffers in this way (yes, applications really do this). We
> can't afford doing any more work than is absolutely necessary. When
> you get 10000 or more draw_vbo calls per second, everything matters.
> 
> Currently, the radeon drivers have one upload buffer for vertices and
> it stays mapped until the command stream is flushed. When they get a
> user buffer, they do one memcpy and that's all. They don't touch
> winsys unless the upload buffer is full.

So the optimization we're really talking about here is saving the
map/unmap overhead on the upload buffer?

And if the state tracker could do the uploads without incurring the
map/unmap overhead, would that be sufficient for you to feel comfortable
moving this functionality up a level?

> 
> 
> And user-buffers tend not to stay user-buffers - they can be promoted
> to
> regular buffers behind the scenes by the driver.  Would that be
> reflected in this interface somehow?
> 
> I don't think it's needed. The pipe_resource fields can stay immutable
> and drivers can internally replace vertex buffers with their private
> pipe_resources. The state trackers don't need to know about it.
> 
> 
> 
> From the point of view of recording, replaying, debugging, remoting,
> etc. at the gallium boundary, it's preferable if all actions are
> interposable - ie. all actions are mediated by a function call of some
> sort into the gallium interface.  Giving a component a direct memory
> access into buffer contents would tend to defeat that and make
> record/replay of that action difficult.
> 
> Indeed, record/replay would be difficult but not impossbie. FWIW I
> think the interface shouldn't be specifically designed for
> record/replay. Instead, record/replay should be made work with
> whatever interface there is.

Well, yes, but there are some really powerful things you can do with an
interface like gallium if it is possible to fully interpose at this
level.  I'm sure you can come up with examples around remote rendering,
debugging, etc - it's important enough to want to preserve the ability
to interpose and serialize.

> Is it possible to get a description of what you're doing at a slightly
> higher level to try and understand if there's a solution without these
> drawbacks?
> 
> I am quite content with the current interface for user buffers, but it
> will need a few changes for it to be more... efficient. Below is my
> plan for further improving Gallium and its interactions with user
> buffers in general.
> 
> 
> 1) New vertex buffer manager
> 
> This is how I'd like to put the burden of uploading user buffers out
> of the drivers. I've made a new vertex buffer manager. It can be found
> here:
> http://cgit.freedesktop.org/~mareko/mesa/commit/?h=vbuf-mgr&id=94a53b672dd238e6a50bb6b252614dc2e9f30ddf
> 
> And the corresponding branch is here:
> http://cgit.freedesktop.org/~mareko/mesa/log/?h=vbuf-mgr
> 
> It's a module that drivers can use and it does 2 things:
> - uploads user buffers
> - takes care of converting unsupported vertex formats and unaligned
> vertex layouts to supported ones (there are vertex fetcher capability
> bits, see struct u_vbuf_caps)
> 
> Besides some typos in a few commits, this work is already done.
> 
> With this manager, the drivers don't have to deal with user buffers
> when they are bound as vertex buffers. They only get real hardware
> buffers. The drivers don't even have to deal with unsupported vertex
> formats. Moreover, this code has already proven to be fast when it was
> maturing in r300g and is specifically optimized for low overhead. Its
> integration to a driver is easy and straithforward. But I need the
> pipe_resource::user_ptr variable to be able to access the user buffer
> memory in util (efficiently).

What prevents this being efficient at the state-tracker level?  You
mention two things 
-- the map/unmap overhead (ie the state tracker has to constantly
map/unmap because it doesn't know which draw_vbo call will cause the
driver to flush its command buffer).
-- extra capability information about the hardware

> 2) Optimizing vertex array state changes in st/mesa
> 
> Because we don't need to know the sizes of user vertex buffers (as I
> said, we can use pipe_draw_info::min_index and max_index instead, as
> is done in the vertex buffer manager), we can remove the calculation
> of the buffer sizes from st_draw_vbo.

But we still must be calculating that somewhere -- the min_/max_index
info has to come from somewhere in the statetracker.

>  We can also remove pipe_vertex_buffer::max_index (again, the
> information provided by pipe_draw_info is sufficient) and the related
> code from st/mesa. Not only will this simplify st_draw_vbo, it will
> allow us to bind vertex buffers and a vertex elements state only when
> needed, i.e. when either the _NEW_ARRAY or _NEW_PROGRAM flag is set.
> It makes this usage case a lot faster:
> 
> for (i = 0; i < 1000; i++) glDrawElements(...);
> 
> This work is *almost* complete and can be found here:
> http://cgit.freedesktop.org/~mareko/mesa/log/?h=gallium-varrays-optim
> 
> The framerate in the Torcs game goes from 14 to 20 fps with this code
> (in one particular scenario I was optimizing for).
> 
> Since I no longer compute the sizes of user buffers, I have to put ~0
> in the 'size' parameter of user_buffer_create and I expect drivers not
> to use it. r300g, r600g, nv50, nvc0, softpipe, and llvmpipe should be
> ready for this. Not sure about the other drivers, but it's fixable.
> 
> --
> So this all is my current plan to simplify hardware drivers a bit and
> add some nice optimizations. Another option would be to move the new
> vertex buffer manager or something equivalent to the state tracker and
> remove user buffers from the Gallium interface, but that would be
> additional work with uncertain performance implications, so I decided
> not to take this path (for now).

I've still got some catching up to do on your code (sorry for being
slow), but I want to make sure you get what you need to make this work.
The only real constraint I don't want to break is the ability to
interpose and serialize at the gallium interface layer.

Keith