[Mesa-dev] [PATCH v2 04/11] gallium: add endian_format field to struct pipe_resource

Wed Apr 20 09:14:15 UTC 2016

On Wed, Apr 20, 2016 at 12:04 PM, Michel Dänzer <michel at daenzer.net> wrote:
> On 20.04.2016 17:48, Oded Gabbay wrote:
>> On Wed, Apr 20, 2016 at 11:28 AM, Michel Dänzer <michel at daenzer.net> wrote:
>>> On 20.04.2016 03:13, Oded Gabbay wrote:
>>>> On Tue, Apr 19, 2016 at 5:59 PM, Marek Olšák <maraeo at gmail.com> wrote:
>>>>> On Tue, Apr 19, 2016 at 3:11 PM, Oded Gabbay <oded.gabbay at gmail.com> wrote:
>>>>>> On Mon, Apr 18, 2016 at 6:03 PM,
>>>>>> Ilia Mirkin <imirkin at alum.mit.edu> wrote:
>>>>>>> On Mon, Apr 18, 2016 at 10:47 AM, Oded Gabbay <oded.gabbay at gmail.com> wrote:
>>>>>>>> On Thu, Apr 14, 2016 at 6:44 PM, Ilia Mirkin <imirkin at alum.mit.edu> wrote:
>>>>>>>>> On Thu, Apr 14, 2016 at 11:08 AM, Oded Gabbay <oded.gabbay at gmail.com> wrote:
>>>>>>>>>>> Wouldn't it make more sense to handle such issues in transfer_map?
>>>>>>>>>>> (i.e. create a staging memory area, and decode into it)? This assumes
>>>>>>>>>>> that the transfer_map() call has enough information to "do the right
>>>>>>>>>>> thing". I don't think it does today, but perhaps it could be taught?
>>>>>>>>>> It doesn't have all the info today, that's for sure. I imagine though
>>>>>>>>>> we can add parameters to it.
>>>>>>>>>>
>>>>>>>>>>> That way everything that's in a pipe_resource is in some
>>>>>>>>>>> tightly-controlled format, and we specify the LE <-> BE parameters
>>>>>>>>>>> when converting between CPU-read/written and GPU-read/written data. I
>>>>>>>>>>> believe this is a better match for what's really happening, too. What
>>>>>>>>>>> do you think?
>>>>>>>>>>>
>>>>>>>>>>>   -ilia
>>>>>>>>>>
>>>>>>>>>> Unless I'm missing something, I think, at the end of the day, it will
>>>>>>>>>> be the same issues as in my solution - per code path per format is a
>>>>>>>>>> different case. That's because you will still need to "teach"
>>>>>>>>>> transfer_map, per each transfer per format what to do. So one will
>>>>>>>>>> need to go and debug every single code path there is in mesa for
>>>>>>>>>> drawing/copying/reading/textures/etc., like what I did in the last 1.5
>>>>>>>>>> months. It's a great learning experience but it won't give anything
>>>>>>>>>> generic.
>>>>>>>>>>
>>>>>>>>>> Again, for example, in st_ReadPixels, I imagine you will need to give
>>>>>>>>>> "different orders" to transfer_map for the two different scenarios -
>>>>>>>>>> H/W blit and fallback. So what's the gain here ?
>>>>>>>>>>
>>>>>>>>>> If I'm missing something, please tell me.
>>>>>>>>>
>>>>>>>>> One of us is... let's figure out which one :)
>>>>>>>>>
>>>>>>>>> Here's my proposal:
>>>>>>>>>
>>>>>>>>> All data stored inside of resources is stored in a driver-happy
>>>>>>>>> format. The driver ensures that it's stored in proper endianness, etc.
>>>>>>>>> (Much like it does today wrt proper stride.)
>>>>>>>>>
>>>>>>>>> Blitting(/copying) between resources doesn't require any additional
>>>>>>>>> information, since you have the format(s) of the respective resources,
>>>>>>>>> and it's all inside the driver, so the driver does whatever it needs
>>>>>>>>> to do to make it all "work".
>>>>>>>>>
>>>>>>>>> *Accessing and modifying* resources (directly) from the CPU is what
>>>>>>>>> becomes tricky. The state tracker may have incorrect expectations of
>>>>>>>>> the actual backing data. There are a few different ways to resolve
>>>>>>>>> this. The one I'm proposing is that you only ever return a pointer to
>>>>>>>>> the directly underlying data if it matches the CPU's expectations
>>>>>>>>> (which will only be the case for byte-oriented array formats like
>>>>>>>>> PIPE_FORMAT_R8G8B8A8_* & co). Everything else, like e.g.
>>>>>>>>> PIPE_FORMAT_R5G6B5_UNORM and countless others, will have to go through
>>>>>>>>> a bounce buffer.
>>>>>>>>>
>>>>>>>>> At transfer map time, you convert the data from GPU-style to
>>>>>>>>> CPU-style, and copy back the relevant bits at unmap/flush time.
>>>>>>>>>
>>>>>>>>> This presents a nice clean boundary for this stuff. Instead of the
>>>>>>>>> state tracker trying to guess what the driver will do and feeding it
>>>>>>>>> endiannesses that it can't possibly guess properly, the tracking logic
>>>>>>>>> is relegated to the driver, and we extend the interfaces to allow the
>>>>>>>>> state tracker to access the data in a proper way.
>>>>>>>>>
>>>>>>>>> I believe the advantage of this scheme is that beyond adding format
>>>>>>>>> parameters to pipe_transfer_map() calls, there will not need to be any
>>>>>>>>> adjustments to the state trackers.
>>>>>>>>>
>>>>>>>>> One yet-to-be-resolved issue is what to do about glMapBuffer* - it
>>>>>>>>> maps a buffer, it's formatless (at map time), and yet the GPU will be
>>>>>>>>> required to interpret it correctly. We could decree that PIPE_BUFFER
>>>>>>>>> is just *always* an array of R8_UNORM and thus never needs any type of
>>>>>>>>> swapping. The driver needs to adjust accordingly to deal with accesses
>>>>>>>>> that don't fit that pattern (and where parameters can't be fed to the
>>>>>>>>> GPU to interpret it properly).
>>>>>>>>>
>>>>>>>>> I think something like the above will work. And I think it presents a
>>>>>>>>> cleaner barrier than your proposal, because none of the "this GPU can
>>>>>>>>> kinda-sorta understand BE, but not everywhere" details are ever
>>>>>>>>> exposed to the state tracker.
>>>>>>>>>
>>>>>>>>> Thoughts?
>>>>>>>>>
>>>>>>>>>   -ilia
>>>>>>>>
>>>>>>>> Ilia,
>>>>>>>>
>>>>>>>> To make the GPU do a conversion during blitting, I need to configure
>>>>>>>> registers. This is done in a couple of functions in the r600g driver
>>>>>>>> (r600_translate_texformat, r600_colorformat_endian_swap,
>>>>>>>> r600_translate_colorformat and r600_translate_colorswap).
>>>>>>>>
>>>>>>>> The problem is that transfer_map/unmap don't call directly to those
>>>>>>>> functions. They call other functions which eventually call those 4
>>>>>>>> functions. Among those "other" functions, there are several function
>>>>>>>> calls which are *not* in the r600g driver. i.e. we go back to generic
>>>>>>>> util functions. For example:
>>>>>>>>
>>>>>>>> #0  r600_translate_colorformat
>>>>>>>> #1  evergreen_init_color_surface
>>>>>>>> #2  evergreen_set_framebuffer_state
>>>>>>>> #3  util_blitter_custom_depth_stencil
>>>>>>>> #4  r600_blit_decompress_depth
>>>>>>>> #5  r600_texture_transfer_map
>>>>>>>>
>>>>>>>> Am I allowed to now pass information from transfer_map/unmap all the
>>>>>>>> way down to the 4 functions I mentioned through all these layers as
>>>>>>>> additional parameters ? I preferred to put it in pipe_resource as that
>>>>>>>> information goes all the way down to those functions, but if I can't
>>>>>>>> use that, then what's an acceptable alternative ?
>>>>>>>>
>>>>>>>> This time, I would like to get an agreement *before* I implement it.
>>>>>>>
>>>>>>> Probably a good idea. And as issues are investigated, people's
>>>>>>> opinions on the "correct" way might shift. Let's think about this...
>>>>>>>
>>>>>>> So clearly *a* correct way to handle this would be to stop all the
>>>>>>> lying. What's the lie? The lie is the PIPE_FORMAT. It talks about e.g.
>>>>>>> R5G6B5 but makes no mention of the byte layout in memory for those 16
>>>>>>> bits. Really what we have right now is a format and an *implicit*
>>>>>>> endian ordering, which is the CPU's. But what happens when the CPU and
>>>>>>> GPU don't agree?
>>>>>>>
>>>>>>> There's a path we could take which would be to add an endianness
>>>>>>> alongside each format (be it by doubling formats, or an explicit
>>>>>>> second field). This would be a very far-reaching change though, and I
>>>>>>> doubt you'll want to do it. What we're left with is having a format
>>>>>>> and an *implicit* endianness. Which means that the consumers of the
>>>>>>> format need to be able to work out the implicit endianness involved.
>>>>>>> And the endianness will be GPU endian for regular resources, and CPU
>>>>>>> endian for "staging" resources. So it's definitely tempting to stick
>>>>>>> the endian thing into a private field of the resource, like Rob is
>>>>>>> suggesting - when creating a staging texture in
>>>>>>> transfer_map/unmap/flush, set the endianness the cpu endian. Otherwise
>>>>>>> set it to gpu endian. And I think this is somewhat similar to your
>>>>>>> former approach.
>>>>>>>
>>>>>>> What do you think?
>>>>>>>
>>>>>>>   -ilia
>>>>>>
>>>>>> I don't think I have any other choice but to stick it as a private
>>>>>> field, because the endian parameter simple can't go through all the
>>>>>> function calls as an additional parameter. The reason is that
>>>>>> set_framebuffer_state() function types are called from
>>>>>> st_invalidate_state, where I don't have any idea about the "correct"
>>>>>> endianess, so I can't add the endian parameter to that function type.
>>>>>>
>>>>>> The only thing that is propagated through all layers is r600_texture.
>>>>>> I'll try to use that.
>>>>>>
>>>>>> Marek, Michel,
>>>>>> Do you think it is OK to add the endian mark to that private structure ?
>>>>>
>>>>> Yes, but doesn't util_format_description::is_array provide that info already?
>>>>
>>>> If you mean to say that they are interchangeable, than no.
>>>> It's true that for array formats we *never* need to do swaps, but for
>>>> non-array formats there are cases where we need to do it and cases
>>>> where we don't need to do it.
>>>>
>>>> For example, when writing to depth buffer through st_DrawPixels
>>>> (glDrawPixels) with GL_FLOAT, than the format used is
>>>> PIPE_FORMAT_Z32_FLOAT.
>>>> In the make_texture() part, we need to configure endian swap for the
>>>> staging buffer, but after you return from it and go into
>>>> st_create_texture_sampler_view(), you want the destination texture to
>>>> be configured without endian swapping.
>>>
>>> So the CB/DB block can't swap bytes when accessing Z32?
>>>
>> Not sure what you mean by this sentence. Could you please further
>> explain your question ?
>
> For packed formats / components such as Z32, the working theory is that
> state tracker / Mesa core / application code accesses them using the CPU
> byte order. So if the hardware can be programmed to always access them
> in the CPU byte order, that should theoretically work in all cases.
>
>
> --
> Earthling Michel Dänzer               |               http://www.amd.com
> Libre software enthusiast             |             Mesa and X developer

Well, it doesn't work in practice. If that were true, than we wouldn't
need all this hints. We would do swaps all the time.

And if you think about it, once you use a staging buffer in the
middle, than you don't want to do two swaps. If you do two swaps (from
CPU to staging buffer and from staging buffer to GPU), you will return
to the original layout, which is big-endian in our case, which the GPU
can't handle.

     Oded