[Mesa-dev] [PATCH] radeonsi: add cs tracing v2

Tue Mar 26 12:28:55 PDT 2013

On Tue, Mar 26, 2013 at 6:44 PM, Christian König
<deathsimple at vodafone.de> wrote:
> Am 26.03.2013 18:02, schrieb Jerome Glisse:
>
>> On Tue, Mar 26, 2013 at 12:40 PM, Marek Olšák <maraeo at gmail.com> wrote:
>>>
>>> On Tue, Mar 26, 2013 at 3:59 PM, Christian König
>>> <deathsimple at vodafone.de> wrote:
>>>>
>>>> Am 26.03.2013 15:34, schrieb Marek Olšák:
>>>>
>>>>> Speaking of si_pm4_state, I think it's a horrible mechanism for
>>>>> anything other than constant state objects (create/bind/delete
>>>>> functions). For everything else (set/draw functions), you want to emit
>>>>> directly into the command stream. It's not so different from the bad
>>>>> state management which r600g used to have (which is now gone). If you
>>>>> have to call malloc or calloc in a set_* or draw_* function, you're
>>>>> doing it wrong. Are there plans to change it to something more
>>>>> efficient (e.g. how r300g and r600g emit non-CSO states right now), or
>>>>> will it be like this forever?
>>>>
>>>>
>>>> Actually I hoped that r600g sooner or later moves into the same
>>>> direction
>>>> some more. The fact that we currently need to malloc every buffer indeed
>>>> sucks badly, but that is still better than mixing packet generation with
>>>> driver logic.
>>>
>>> I don't understand the last sentence. What mixing? The set_* and
>>> draw_* commands are supposed to be executed immediately, therefore
>>> it's reasonable and preferable to write to the CS directly. Having any
>>> intermediate storage for commands is a waste of time and space.
>>
>> I agree here, i don't think uncached bo for command stream on new hw
>> would bring huge perf increase, probably will just be noise.
>>
>>>> Also I don't think that emitting directly into the command stream is
>>>> such a
>>>> good idea, we sooner or later want that buffer to be a buffer allocated
>>>> in
>>>> GART memory. And under this condition it is better to build up the
>>>> commands
>>>> in a (heavily cached) system memory and then memcpy then to the
>>>> destination
>>>> buffer.
>>>
>>> AFAIK, GART memory is cached on non-AGP systems, but even uncached
>>> access shouldn't be a big issue, because the access pattern is
>>> sequential and write-only. BTW, I have talked about emitting commands
>>> into a buffer object with Dave and he thinks it's a bad idea due to
>>> the map and unmap overhead. Also, we have to disallow writing to
>>> certain unsafe registers anyway.
>>>
>>> Marek
>>
>> I think Christian is thinking about new hw > cayman where we can skip
>> register checking because of vm and hardware register checking (the hw
>> CP checks that register in the user IB is not one of the privilege
>> register and block write and throw irq if so). On this kind of hw you
>> can have cmd stream in bo and don't do the map/unmap.
>
>
> Yes indeed, and my plan is to avoid the copying by referencing the state
> directly with indirect buffer commands. That should also make thinks like
> queries and predicated rendering a bit more simpler (think of PM4 subroutine
> calls).
>
> The problem on SI is that for embedded data and const IBs you need to patch
> up the buffer quite a bit after it is written (at least if I understand them

If you need to patch up the buffer after it is written, i.e. it can't
be immutable, you probably shouldn't use any buffer at all and just
write the commands into the CS directly. How are you gonna update the
buffer if it's busy?

Unless you mean that some state depends on some other state and must
be recomputed, which is another aspect of radeonsi which is pretty
badly designed. Don't recompute states, just create multiple command
buffers covering all possible combinations of states you can ever get
with a single gallium CSO and switch between them at draw time. E.g.
when some external state changes from I to J, switch from
blend_state->variant[I] to blend_state->variant[J] (by just emitting
variant[J]), where variant is an array of immutable command buffers.
Same for pipe_surface (except that pipe_surface should initialize the
variants on demand).

> correctly). But Marek is quite right that this only counts for state objects
> and makes no sense for set_* and draw_* calls (and I'm currently thinking
> how to avoid that and can't come up with a proper solution). Anyway it's
> definitely not an urgent problem for radeonsi.

It will be a problem once we actually start caring about performance
and, most importantly, the CPU overhead of the driver.

>
> I still think that writing into the command buffers directly (e.g. without
> wrapper functions) is a bad idea, cause that lead to mixing driver logic and

I'm convinced the exact opposite is a bad idea, because it adds
another layer all commands must go through. A layer which brings no
advantage. Think about apps which issue 1k-10k draw calls per frame.
It's obvious that every byte moved around counts and the key to high
framerate is to do (almost) nothing in the driver. It looks like the
idea here is to make the driver as slow as possible.

> packet building in r600g. For example just try to figure out how the
> relocation in NOPs work by reading the source (please keep in mind that one
> of the primary goals why AMD is supporting this driver is to give a good
> example code for customers who want to implement that stuff on their own
> systems).

I'm shocked. Sacrificing performance in the name of making the code
nicer for some customers? Seriously? I thought the plan was to make
the best graphics driver ever.

Marek