[Mesa-dev] [PATCH] radeonsi: add cs tracing v2

Tue Mar 26 12:54:28 PDT 2013

Am 26.03.2013 20:28, schrieb Marek Olšák:
> On Tue, Mar 26, 2013 at 6:44 PM, Christian König
> <deathsimple at vodafone.de> wrote:
>> Am 26.03.2013 18:02, schrieb Jerome Glisse:
>>
>>> On Tue, Mar 26, 2013 at 12:40 PM, Marek Olšák <maraeo at gmail.com> wrote:
>>>> On Tue, Mar 26, 2013 at 3:59 PM, Christian König
>>>> <deathsimple at vodafone.de> wrote:
>>>>> Am 26.03.2013 15:34, schrieb Marek Olšák:
>>>>>
>>>>>> Speaking of si_pm4_state, I think it's a horrible mechanism for
>>>>>> anything other than constant state objects (create/bind/delete
>>>>>> functions). For everything else (set/draw functions), you want to emit
>>>>>> directly into the command stream. It's not so different from the bad
>>>>>> state management which r600g used to have (which is now gone). If you
>>>>>> have to call malloc or calloc in a set_* or draw_* function, you're
>>>>>> doing it wrong. Are there plans to change it to something more
>>>>>> efficient (e.g. how r300g and r600g emit non-CSO states right now), or
>>>>>> will it be like this forever?
>>>>>
>>>>> Actually I hoped that r600g sooner or later moves into the same
>>>>> direction
>>>>> some more. The fact that we currently need to malloc every buffer indeed
>>>>> sucks badly, but that is still better than mixing packet generation with
>>>>> driver logic.
>>>> I don't understand the last sentence. What mixing? The set_* and
>>>> draw_* commands are supposed to be executed immediately, therefore
>>>> it's reasonable and preferable to write to the CS directly. Having any
>>>> intermediate storage for commands is a waste of time and space.
>>> I agree here, i don't think uncached bo for command stream on new hw
>>> would bring huge perf increase, probably will just be noise.
>>>
>>>>> Also I don't think that emitting directly into the command stream is
>>>>> such a
>>>>> good idea, we sooner or later want that buffer to be a buffer allocated
>>>>> in
>>>>> GART memory. And under this condition it is better to build up the
>>>>> commands
>>>>> in a (heavily cached) system memory and then memcpy then to the
>>>>> destination
>>>>> buffer.
>>>> AFAIK, GART memory is cached on non-AGP systems, but even uncached
>>>> access shouldn't be a big issue, because the access pattern is
>>>> sequential and write-only. BTW, I have talked about emitting commands
>>>> into a buffer object with Dave and he thinks it's a bad idea due to
>>>> the map and unmap overhead. Also, we have to disallow writing to
>>>> certain unsafe registers anyway.
>>>>
>>>> Marek
>>> I think Christian is thinking about new hw > cayman where we can skip
>>> register checking because of vm and hardware register checking (the hw
>>> CP checks that register in the user IB is not one of the privilege
>>> register and block write and throw irq if so). On this kind of hw you
>>> can have cmd stream in bo and don't do the map/unmap.
>>
>> Yes indeed, and my plan is to avoid the copying by referencing the state
>> directly with indirect buffer commands. That should also make thinks like
>> queries and predicated rendering a bit more simpler (think of PM4 subroutine
>> calls).
>>
>> The problem on SI is that for embedded data and const IBs you need to patch
>> up the buffer quite a bit after it is written (at least if I understand them
> If you need to patch up the buffer after it is written, i.e. it can't
> be immutable, you probably shouldn't use any buffer at all and just
> write the commands into the CS directly. How are you gonna update the
> buffer if it's busy?
>
> Unless you mean that some state depends on some other state and must
> be recomputed, which is another aspect of radeonsi which is pretty
> badly designed. Don't recompute states, just create multiple command
> buffers covering all possible combinations of states you can ever get
> with a single gallium CSO and switch between them at draw time. E.g.
> when some external state changes from I to J, switch from
> blend_state->variant[I] to blend_state->variant[J] (by just emitting
> variant[J]), where variant is an array of immutable command buffers.
> Same for pipe_surface (except that pipe_surface should initialize the
> variants on demand).

Yeah, I know. I only have two hands, where should do you suggest I 
should start?

>> correctly). But Marek is quite right that this only counts for state objects
>> and makes no sense for set_* and draw_* calls (and I'm currently thinking
>> how to avoid that and can't come up with a proper solution). Anyway it's
>> definitely not an urgent problem for radeonsi.
> It will be a problem once we actually start caring about performance
> and, most importantly, the CPU overhead of the driver.
>
>> I still think that writing into the command buffers directly (e.g. without
>> wrapper functions) is a bad idea, cause that lead to mixing driver logic and
> I'm convinced the exact opposite is a bad idea, because it adds
> another layer all commands must go through. A layer which brings no
> advantage. Think about apps which issue 1k-10k draw calls per frame.
> It's obvious that every byte moved around counts and the key to high
> framerate is to do (almost) nothing in the driver. It looks like the
> idea here is to make the driver as slow as possible.
>
>> packet building in r600g. For example just try to figure out how the
>> relocation in NOPs work by reading the source (please keep in mind that one
>> of the primary goals why AMD is supporting this driver is to give a good
>> example code for customers who want to implement that stuff on their own
>> systems).
> I'm shocked. Sacrificing performance in the name of making the code
> nicer for some customers? Seriously? I thought the plan was to make
> the best graphics driver ever.

Well, maybe I'm repeating myself: Performance is not a priority, it's 
only nice to have!

Sorry to say so, but if we sacrifice a bit of performance for more code 
readability than that is perfectly ok with me (Don't understand me wrong 
I would really prefer to replace the closed source driver today than 
tomorrow, it's unfortunately just not what I'm paid for).

On the other hand, we are talking about perfectly optimizeable inline 
functions and/or macros. All I'm saying is that we should structurize 
the code a bit more.

Christian.

> Marek
>