[Mesa-dev] a newbie asking newbie questions

Tue Sep 17 09:25:25 PDT 2013

On 09/17/2013 05:13 AM, Rogovin, Kevin wrote:
> Hello,
> 
>  Thank you for the very fast answers, some more questions:
> 
> 
>> It's not a preference question.  The registers are 8 floats wide.
>> Vertex shaders get invoked 2 vertices at a time, with a register containing these values:
>>
>> .   +------+------+------+------+------+------+------+------+
>> .   | v0.x | v0.y | v0.z | v0.w | v1.x | v1.y | v1.z | v1.w |
>> .   +------+------+------+------+------+------+------+------+
> 
> This seems best to me: run two vertices in each invocation with the hopes that the
> shader compiler will merge (multiple) float, vec2 and maybe even vec3 operations into 
> vec4 operations (does it)?

Not as well as it should.  There's a lot of room for improvement in our
SIMD4x2/vector backend.  We haven't spent a ton of effort optimizing it
since vertex shaders have rarely been the bottleneck in application
performance.

>> while these 8 pixels in screen space:
>>
>> .   +----+----+----+----+
>> .   | p0 | p1 | p2 | p3 |
>> .   +----+----+----+----+
>> .   | p4 | p5 | p6 | p7 |
>> .   +----+----+----+----+
>>
>> are loaded in fragment shader registers as:
>>
>> .   +------+------+------+------+------+------+------+------+
>> .   | p0.x | p1.x | p4.x | p5.x | p2.x | p3.x | p6.x | p7.x |
>> .   +------+------+------+------+------+------+------+------+
>>
>> Note how one register just holds a single channel ('.x' here) of a vector.  A vec4 would take up 4 registers, and to do value0.xyzw * value1.xyzw, you'd emit 4 MULs.
> 
> This is exactly what I was trying to ask/say about the fragment shader running, i.e. n-fragments are processed with 1 n-SIMD command (for i965, n=8),
> sighs my e-mail communications leave something to be desired. 
> Some questions:
>  1) do the fragments need to be in a 4x2 block, or can it be two separate 2x2 blocks?

The GPU processes two separate 2x2 blocks of pixels, which may actually
not be anywhere near each other.

>  2) for tiny triangles for fragment shaders that do not require dFdx, dFdy or fwidth, can the fragments be totally scattered?

Nope, the pixel shader always works on 2x2 blocks.

> Along further lines, for non-dependent texture lookups, are there code lines where the derivatives are computed
> analytically so that selecting the correct LOD does not require to process fragments in 2x2 (or larger) blocks? Or does
> the i965 hardware sampler interface does not allow this kind of madness? 
> 
>>> On a related note, where are the beans about the dispatch table?
>> I don't know this one (or particularly what you're asking, I guess).
> 
> Viewing docs/index.html, on the side panel "Developer Topics --> GL
> Dispatch" there is text (broken into sections "1. Complexity of GL
> Dispatch", "2. Overview of Mesa's Implementation" and "3. Optimizations
> " describing how different GL contexts for the same hardware can do
> different things for the same GL function and that mesa has stubs which
> in turn call the "real" function. The documents go on to talk about
> various ways the function tables are filled and accessed across separate
> threads. My questions are:
>  0) is that information text still accurate? In particular, the directory src/glapi is gone from Mesa (atleast what I git cloned) and I thought that was the location of it.
>  1) where/how does the i965 driver fill that table, if it exists?
>  
> Along similar lines, I see that some of the code in src/mesa/main performs various checks of various API calls and at times has some conditions dependent on what context type it is, which kind of contradicts the idea of different context have different dispatch tables [sort of, since the functions might just be the driver magick, where as the stub is validate and then call driver magick]. 
> 
> -Kevin