[Mesa-dev] Gallium proposal: add a user pointer in pipe_resource

Mon Feb 7 08:15:01 PST 2011

Marek, 

I'm fine with keeping user buffers -- it's only a vague hope they'll fade away over time, and I'm comfortable with keeping them as long as their behaviour is well understood. 

The really important thing for me is to preserve traceability. That is to say it should be possible to observe what happens over the interface and infer directly from that when something important happens. In this case, that would mean having a way to notice that the contents and/or size of a userbuffer changed. 

That could be as simple as a notification call that this has happened, for example "redefine_user_buffer()". On your current drivers that call would be a noop -- hopefully that's not going to be a noticiable performance hit? Then in some tracing or recording module, that call could be used to log the contents of the userbuffer to a file, or in some future indirect-rendering equivalent, the new contents could be transmitted across the wire, etc. 

This would mean that userbuffers continue to have a known size, etc, and would require the state-tracker to issue the redefine call as necessary. You're in a better position than me to comment on the performance impact of this. 

If you're saying this isn't workable from a performance POV, then as you suggest we'll have to find a way to push the logic for identifying when userbuffers changed down into those components (tracing, remoting, etc) which care about it. 

Keith 

----- Original Message -----
From: "Marek Olšák" <maraeo at gmail.com> 
To: "Keith Whitwell" <keithw at vmware.com> 
Cc: mesa-dev at lists.freedesktop.org 
Sent: Sunday, 6 February, 2011 12:01:01 PM 
Subject: Re: [Mesa-dev] Gallium proposal: add a user pointer in pipe_resource 

Hi Keith, 

1) Recreating user buffers is very expensive, even though it's only the CALLOC overhead. Draw-call-heavy apps simply suffer hard there. It's one of the things the gallium-varrays-optim branch tries to optimize, i.e. make the user buffer content mutable. I can't see another way out. 

2) The map/unmap overhead is partially hidden by the fact that: 
- r300g doesn't unmap buffers when asked to, it defers unmapping until the command stream is flushed. This optimization has resulted in about 70% frame rate increase in Nexuiz. The overhead there is now mainly when locking and unlocking a mutex and doing some checks. 
- r600g keeps all buffers mapped all the time, even textures. The only disadvantage is it consumes address space. This is a result of desperation we have with draw-call-heavy apps. (Do you remember that I wanted to add spinlocks? Frankly, that was another desperate move.) 

But it's not enough. We must prevent any unnecessary calls to transfer_map/unmap. If keeping the upload buffer mapped a little longer results in 4% perfomance increase, then I want it. I have measured the real increase from this in Torcs and it's simply worth it. The problem with inline transfers is it's like map/unmap, so it wouldn't improve anything. 

3) Not sure if you noticed, but constants are now set via user buffers as well. IIRC, Radeon and Nouveau people welcomed this change. The thing is every driver uses a different approach to uploading constants and all it needs is a direct pointer to gl_program_parameter_list::ParameterValues to do the best job. Previously, drivers stored constants in malloc'd memory, which was basically just a temporary copy of ParameterValues. Eliminating that copy was the main motivation for using user buffers for constants. r300g copies the constants to the command stream directly, whereas r600g uses u_upload_mgr, and I guess other drivers do something entirely different. As you can see, we can't get rid of user buffers while keeping all drivers on the fast path. But I'd be ok with a new set_constant_buffer(data?) function which takes a pointer to constants instead of a resource. With that, we could remove the overhead of user_buffer_create for constants. The original set_constant_buffer function can be reserved for ARB_uniform_buffer_object, but shouldn't ideally be used for anything else. 

I fully understand that you want a robust interface. I would totally agree with you if I didn't spend months profiling Mesa. I'd like to have the same except that I also want it to be performance-oriented. I am afraid it will be very hard to have that and the robustness at the same time. I and other driver devs really want to compete with proprietary drivers in terms of performance. 

On Tue, Feb 1, 2011 at 6:55 PM, Keith Whitwell < keithw at vmware.com > wrote: 

So the optimization we're really talking about here is saving the 
map/unmap overhead on the upload buffer? 

And if the state tracker could do the uploads without incurring the 
map/unmap overhead, would that be sufficient for you to feel comfortable 
moving this functionality up a level? 

Because one of the keys to performance is to do as little CPU work as possible, I'd like the upload buffer to stay mapped as long as possible and I'd like it to be used for drawing when mapped. This is OK for Radeons, because the GPU can read some part of the buffer while some other part is being filled by the CPU. However, it wouldn't change the situation with regard to recording and replaying at all. This is one of the reasons I'd like user buffers to stay and I'd like them mutable at least for vertices. The eventual record/replay module could use the information provided by pipe_draw_info::min_index and max_index to know what regions of user buffers may have been changed. 

But we still must be calculating that somewhere -- the min_/max_index 
info has to come from somewhere in the statetracker. 

This info can be obtained from pipe_draw_info and that's sufficient. There is no reason to set pipe_vertex_buffer::max_index nor pipe_resource::width0 for user buffers. Neither r300g nor r600g uses this additional information. Computing pipe_vertex_buffer::max_index and recreating user buffers only to set width0, which I don't need for anything, significantly reduces performance in draw-call-heavy apps. For example, if I dropped this computation, there would be 40% performance increase with r300g in the Torcs racing game in the Forza track (I've measured it). Even seemingly-performance-unrelated code can have a huge impact. (my experience tells me I should replace "even" with "all") 

No matter how I look at this whole issue, I can't see user buffers going away without losses. I may try to move this functinality up one level in a branch, minimize the losses as much as possible (possibly keeping some buffers persistently mapped at the driver level, not at the state tracker level), see how it performs, and then decide what to do next. But I can't say now how fast it will be. 

Best regards 
Marek 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.freedesktop.org/archives/mesa-dev/attachments/20110207/da82503f/attachment.html>