[Mesa-dev] [RFC 00/20] shader compiler cache

Wed Jun 11 17:41:48 PDT 2014

I've been looking at this is a bit since last week, and I have some
concerns about the overall way things are structured.  There are also
almost no checks that the values being de-serialized are valid.  What
happens if an ir_expression::operation is 0xDEADBEEF?  I think I know
where we want to be for performance and maintainability, but I'm not
100% how to get there from here.

What follows is some stream-of-conciousness rambling on the topic...

We really want to have 3 operations associated with each piece of IR:

1. Convert the IR to a blob of bits.

2. Validate that a blob of bits is valid.  This should at least validate
that all the individual values are sane (e.g., the operation of a
ir_expression), and it should probably also detect more fine detail
errors (e.g., if the operation of an ir_expression is a binary
operation, there must be exactly two operands).

3. Convert the blob of bits back to an IR object.

Each of these operations should be implemented separately (e.g., in
different functions), and, ideally, each set of 3 things that operate on
one kind of IR object would be in one patch... so that it's easier to
review.

Having these separate but grouped in a logical way should make it easier
to maintain when the IR changes.  If someone changes an IR object or
adds a new one, it should be pretty obvious which functions in the
serializer they need to modify or add.

My current thinking is that each kind of thing that is serialized (user
defined type, strings, and instructions) should be grouped together.
During reading and writing, these would get stored in a table.  The blob
of bits will then just contain table offsets.

This means that we'd want to emit instructions from the tree IR
bottom-up.  So for a tree like

       +
     /   \
    *     -
   / \   / \
  a   b c   d

we'd emit the derefs of a and b first (to be stored in the table at
locations N and N+1), then the multiply would reference table locations
N and N+1 (and be stored at location N+2).  Then the derefs of c and d
first (to be stored in the table at locations N+3 and N+4), then the
subtract would reference table locations N+3 and N+4 (and be stored at
location N+5).  Finally, the add would reference locations N+2 and N+5
(and be stored at location N+6).

All of this bottom-up processing of the IR tree should be done using an
ir_hierarchical_visitor.  Duplicating the traversal logic with the
serialization logic is a maintainence problem (in addition to
duplicating code).

Things that have a list of instructions (ir_function, ir_if, etc.) would
just have a count and a list of indices.  Maybe have the list of indices
also end with a special sentinel so that various kinds of corruption
(length being too large or small, list corrupted, etc.) can be detected.

During reading, the table would contain a pointer to an IR object and a
flag.  The flag is initially false, and it is set to true when the IR
object is used.  If something tries to use a table entry with the flag
already true, generate an error.  Some special handling of ir_variable
is needed.  An ir_dereference_variable can only use an ir_varible that
has the flag true.

If the different things (strings, types, and instructions) are stored in
that order, the reader could make a single pass over the buffer, and
conversion should be very efficient.  Instead of having an explicit
string table, maybe just store offsets into the string buffer.  You'd
want checks that offset-1 is a nul (so store a nul at the beginning of
the buffer) and offset is inside the table.  We could optimize the table
so that variable names "foo" only appear once in the table, but that can
be a future optimization if we ever care.

Each of the individual instruction blobs should have a structure
associated with them.  Then we can just

    const blob_expression *eb = (blob_expression *) ptr;

    if (!blob_validate(eb, some other data))
        return fail;

    ptr += sizeof(blob_expression);

    instruction_table[i++] =
        blob_create_expression(eb, some other data);

    ...

Maybe blob_create_* should return the next pointer location too.  Dunno.

On 06/02/2014 05:05 AM, Tapani Pälli wrote:
> Hello;
> 
> This series provides a disk cache for the shader compiler and is used 
> to 'skip glLinkProgram' like GL_ARB_get_program_binary does but under 
> the hood without api for the client.
> 
> Many of the patches are from my 'GL_ARB_get_program_binary' series and 
> some were reviewed by Paul earlier. I've tried to split the whole thing 
> in smaller pieces when possible and added more validation in IR level 
> and also higher level. Most importantly sizeof(long) and some important 
> structure sizes must match, otherwise cache gets invalidated.
> 
> There are some new 'uncomfortable' things that Mesa must do when having 
> a disk cache. This includes directory creation, cache size management 
> and just general file read/write. Some of these might not have been 
> written in a portable way (especially patch 20), feedback appreciated.
> 
> This set passes all of other Piglit quick tests but not the transform 
> feedback tests. I believe most (if not all) of them can be fixed by 
> having key generation based on some of the variables used for xfb in 
> gl_shader_program and also serializing xfb related structures but I 
> have not done this yet as I think the key generation overall would need 
> some better algorithm based on the gl_shader_program contents (only 
> those contents which are always same for the same shader program), 
> would be good to get some feedback for this. Now these contents seem a 
> bit scattered, hard to generate a key.
> 
> I've limiting the supported set of shaders by having additional checks 
> during shader serialization. This is simply because of some things may 
> not be implemented yet or not having good test cases, these will be 
> improved.
> 
> With the series I get 50% improvement for L4D2 startup (goes down from 
> 30s to ~15s) This is the time after startup video to menu screen. With 
> some other apps the improvement varies a lot (as example I measured 
> ~10% improvement with GLBenchmark 2.7). It would be nice to get some 
> more test results, especially with modern games.
> 
> The GL_ARB_get_program_binary extension can be implemented easily on 
> top of this. Also shader backend assembly and other additional data can 
> be added later as part of the same cache.
> 
> If no objections I would be tempted to start pushing in patches that 
> were already reviewed as this is a lot of code to carry around.
> 
> Any comments welcome;
> 
> 
> Tapani Pälli (20):
>   glsl: memory_writer helper class for data serialization
>   glsl: glsl_type serialization
>   glsl: serialize methods for IR instructions
>   glsl: memory_map helper class for data deserialization
>   glsl: add MESA_SHADER_CACHE_MAGIC string for shader binary cache
>   glsl: export populate_symbol_table function
>   glsl: add MAX_NUM_STATE_SLOTS and check against builtin uniforms
>   glsl: glsl_type deserialization
>   glsl: ir_deserializer class for the binary shader cache
>   mesa: iterate method for string_to_uint_map
>   glsl: functions to serialize gl_shader and gl_shader_program
>   mesa: add _Linked helper to gl_shader_program struct
>   glsl: functions to deserialize gl_shader and gl_shader_program
>   mesa: add _mesa_mkdir helper to imports.h
>   glsl: add capability to read/map files to memory_map
>   glsl: add method to deserialize binary program from given path
>   mesa: add helpers for the binary shader cache
>   mesa/program: add disk cache functionality
>   mesa: take shader program cache in to use
>   mesa: binary cache size management
> 
>  src/glsl/Makefile.sources         |   4 +
>  src/glsl/builtin_variables.cpp    |   3 +
>  src/glsl/glsl_types.cpp           | 164 +++++++++
>  src/glsl/glsl_types.h             |  25 ++
>  src/glsl/ir.h                     |  45 +++
>  src/glsl/ir_deserializer.cpp      | 730 ++++++++++++++++++++++++++++++++++++++
>  src/glsl/ir_deserializer.h        | 121 +++++++
>  src/glsl/ir_serialize.cpp         | 340 ++++++++++++++++++
>  src/glsl/ir_serialize.h           |  36 ++
>  src/glsl/linker.cpp               |   2 +-
>  src/glsl/linker.h                 |   3 +
>  src/glsl/memory_map.h             | 230 ++++++++++++
>  src/glsl/memory_writer.h          | 204 +++++++++++
>  src/glsl/shader_cache.h           | 101 ++++++
>  src/glsl/shader_cache_magic.h     |  36 ++
>  src/glsl/shader_deserialize.cpp   | 421 ++++++++++++++++++++++
>  src/glsl/shader_serialize.cpp     | 240 +++++++++++++
>  src/mesa/Makefile.sources         |   1 +
>  src/mesa/main/config.h            |   4 +
>  src/mesa/main/context.c           |   4 +
>  src/mesa/main/imports.h           |  18 +
>  src/mesa/main/mtypes.h            |   7 +
>  src/mesa/program/hash_table.h     |   8 +
>  src/mesa/program/ir_to_mesa.cpp   |   9 +
>  src/mesa/program/prog_diskcache.c | 320 +++++++++++++++++
>  src/mesa/program/prog_diskcache.h |  40 +++
>  26 files changed, 3115 insertions(+), 1 deletion(-)
>  create mode 100644 src/glsl/ir_deserializer.cpp
>  create mode 100644 src/glsl/ir_deserializer.h
>  create mode 100644 src/glsl/ir_serialize.cpp
>  create mode 100644 src/glsl/ir_serialize.h
>  create mode 100644 src/glsl/memory_map.h
>  create mode 100644 src/glsl/memory_writer.h
>  create mode 100644 src/glsl/shader_cache.h
>  create mode 100644 src/glsl/shader_cache_magic.h
>  create mode 100644 src/glsl/shader_deserialize.cpp
>  create mode 100644 src/glsl/shader_serialize.cpp
>  create mode 100644 src/mesa/program/prog_diskcache.c
>  create mode 100644 src/mesa/program/prog_diskcache.h