[Mesa-dev] Resource streamer redux: Enable gather push constants

Abdiel Janulgue abdiel.janulgue at linux.intel.com
Sun Jan 4 06:04:14 PST 2015


I sent previous patches enabling hardware-generated binding tables. By itself,
hw-binding tables gave no performance improvements, it is just a means to 
an end. However, the real meat of the RS hardware is the optimized ability to
map constants to the GRF.

Gather push constants is basically an optimized way of programming push 
constants. What it gives us is the ability to gather and pack constant data that
may reside in a non-contiguous block of any arbitrary buffer object without
incurring additional overhead. The goal of this series is to allow registers
representing combined UBO blocks and uniform to be sequentially allocated and
packed tightly without holes, thus (1) reduce register pressure and 
(2) minimize the use of pull constant loads.

To achieve the same results without the resource streamer, the driver may have
to manually rearrange, reformat, and repack the entries within the already
uploaded UBO block and any uniform buffer that may be present so that the 
entries would carefully match the layout of the allocated GRFs. All of which 
would happen every frame. It get's even worse if a shader fetches its constants
from two or more different constant buffer blocks.

The resource streamer acheives this hardware packing of GRF entries by parsing
a gather table containing hardware-binding table indices, offset, and channel
mask to gather the sparsely-located constant data.

I promised some folks that I would send this out in a coherent state before
the holidays. Unfortunately, I didn't make it in time, but I hope the current 
state should be enough to demonstrate my approach and make reviews possible.

I still lack real-world benchmarks. But consider this simple piglit testcase:
tests/spec/glsl-1.40/uniform_buffer/fs-struct-copy.shader_test.

With the existing method of fetching the ubo entries:

SIMD16 shader: 15 instructions. 0 loops. Compacted 240 to 176 bytes (27%)
mov(1)          g16<1>UD        0x0000000cUD
mov(1)          g18<1>UD        0x00000000UD  
mov(1)          g20<1>UD        0x00000004UD 
send(4)         g2<1>F          g16<0,1,0>F
                            sampler (1, 0, 7, 0) mlen 1 rlen 1 
send(4)         g4<1>F          g18<0,1,0>F
                            sampler (1, 0, 7, 0) mlen 1 rlen 1          
send(4)         g6<1>F          g20<0,1,0>F
                            sampler (1, 0, 7, 0) mlen 1 rlen 1          
add(16)         g8<1>F          g4<0,1,0>F      g6<0,1,0>F   
add(16)         g10<1>F         g4.1<0,1,0>F    g6.1<0,1,0>F   
add(16)         g12<1>F         g4.2<0,1,0>F    g6.2<0,1,0>F   
add(16)         g14<1>F         g4.3<0,1,0>F    g6.3<0,1,0>F  
add(16)         g120<1>F        g8<8,8,1>F      g2<0,1,0>F   
add(16)         g122<1>F        g10<8,8,1>F     g2.1<0,1,0>F   
add(16)         g124<1>F        g12<8,8,1>F     g2.2<0,1,0>F   
add(16)         g126<1>F        g14<8,8,1>F     g2.3<0,1,0>F 
sendc(16)       null            g120<8,8,1>F

Compare with gather constants enabled:

SIMD16 shader: 9 instructions. 0 loops. Compacted 144 to 112 bytes (22%)
add(16)         g4<1>F          g2.4<0,1,0>F    g3<0,1,0>F    
add(16)         g6<1>F          g2.5<0,1,0>F    g3.1<0,1,0>F  
add(16)         g8<1>F          g2.6<0,1,0>F    g3.2<0,1,0>F 
add(16)         g10<1>F         g2.7<0,1,0>F    g3.3<0,1,0>F  
add(16)         g120<1>F        g4<8,8,1>F      g2<0,1,0>F    
add(16)         g122<1>F        g6<8,8,1>F      g2.1<0,1,0>F 
add(16)         g124<1>F        g8<8,8,1>F      g2.2<0,1,0>F  
add(16)         g126<1>F        g10<8,8,1>F     g2.3<0,1,0>F
nop                                                             ;
sendc(16)       null            g120<8,8,1>F


Current Status
--------------

What works: 
- FS, VS uniforms piglit tests pass
- Fragment shader UBOs without mixed uniforms pass
- Fragment shader UBOs mixed with uniforms entries sized vec4 or less pass

What doesn't work yet:
- Fragment shader UBOs with bools
- VS and GS UBOs

Vec4 backend support is not yet done. Once I complete it, I hope to publish
comprehenive benchmark scores.

Patch Summary
-------------

Series lives here: http://cgit.freedesktop.org/~abj/mesa/log/?h=rs_gather_constants0 
 
Patches 1  - 11: Enables hardware-generated binding tables which is a requirement
                 for gather push constants.
Patches 12 - 18: Enables gather push constant support for ordinary uniforms 
Patches 19 - 24: Implements fine-grained uniform uploads.
Patches 26 - 40: Adds FS-backend compiler support to make UBOs as push constants 

I'm not particularly very happy about having to do patch 19. My goal was to make
the driver able to tell which stage actually modified their uniforms. With that
information, uniform uploads actually happen when there is a change, which makes 
the gather table generation more efficient for ordinary uniforms. Ideally, if there
is any way to let the driver accept additional state flags without making the type
size of the state flag variable bigger, I would be more than happy to implement it.

I think the more interesting pieces of this series are in patches 17, 27, 30, and 34
which changes how constants are programmed into the GRF using a gather table in
addition to specifying which channels in a register gets packed and loaded. The
rest are just support for the hardware-enabling bits.

Additional Notes
----------------

This series also needs the kernel support to switch on the resource streamer when
the ringbuffer jumps to the userspace batchbuffer. I have the preliminary
support here: 

http://cgit.freedesktop.org/~abj/linux/log/?h=intel_resource_streamer

Unfortunately, I also ran out of time to rebase the kernel changes to the latest
drm-nightly. I also made additional changes which toggles the hw-binding table
feature within the ring which is actually required when multiple gl clients are
running.

To switch on hw-generated binding tables, set the environment variable
INTEL_RESOURCE_STREAMER=1. To enable gather push constants, set INTEL_GATHER=1 in
addition to the previous resource streamer variable.

 src/mesa/drivers/dri/i965/brw_binding_tables.c   | 199 ++++++++++++++++++++++++++++++++++++++++++-
 src/mesa/drivers/dri/i965/brw_context.c          |  19 ++++-
 src/mesa/drivers/dri/i965/brw_context.h          |  41 ++++++++-
 src/mesa/drivers/dri/i965/brw_defines.h          |  16 ++++
 src/mesa/drivers/dri/i965/brw_draw.c             |  14 +++
 src/mesa/drivers/dri/i965/brw_fs.cpp             |  46 +++++++---
 src/mesa/drivers/dri/i965/brw_fs.h               |   3 +
 src/mesa/drivers/dri/i965/brw_fs_visitor.cpp     |  42 ++++++++-
 src/mesa/drivers/dri/i965/brw_program.c          |  12 +++
 src/mesa/drivers/dri/i965/brw_state.h            |  25 ++++++
 src/mesa/drivers/dri/i965/brw_state_upload.c     |   9 +-
 src/mesa/drivers/dri/i965/brw_vec4_visitor.cpp   |   3 +
 src/mesa/drivers/dri/i965/brw_wm.c               |   3 +
 src/mesa/drivers/dri/i965/brw_wm_surface_state.c |   6 ++
 src/mesa/drivers/dri/i965/gen6_blorp.cpp         |  35 ++++++--
 src/mesa/drivers/dri/i965/gen6_gs_state.c        |   2 +-
 src/mesa/drivers/dri/i965/gen6_vs_state.c        |  51 +++++++----
 src/mesa/drivers/dri/i965/gen6_wm_state.c        |   2 +-
 src/mesa/drivers/dri/i965/gen7_blorp.cpp         |   7 +-
 src/mesa/drivers/dri/i965/gen7_disable.c         |   4 +
 src/mesa/drivers/dri/i965/gen7_vs_state.c        | 136 ++++++++++++++++++++++++++++-
 src/mesa/drivers/dri/i965/gen7_wm_state.c        |   2 +-
 src/mesa/drivers/dri/i965/intel_batchbuffer.c    |  11 ++-
 src/mesa/drivers/dri/i965/intel_reg.h            |   3 +
 src/mesa/main/dd.h                               |   4 +-
 src/mesa/main/mtypes.h                           |   6 +-
 src/mesa/main/state.c                            |  16 ++--
 src/mesa/main/uniform_query.cpp                  |   6 ++
 28 files changed, 660 insertions(+), 63 deletions(-)


More information about the mesa-dev mailing list