[Mesa-dev] [PATCH] i965: Enable disk shader cache by default

Thu Dec 7 17:57:48 UTC 2017

On Tue, Dec 5, 2017 at 3:40 PM, Jordan Justen <jordan.l.justen at intel.com> wrote:
> On 2017-12-05 14:49:28, Mark Janes wrote:
>> Jordan Justen <jordan.l.justen at intel.com> writes:
>>
>> > On 2017-12-05 09:13:11, Mark Janes wrote:
>> >> Jordan Justen <jordan.l.justen at intel.com> writes:
>> >>
>> >> > On 2017-11-08 17:26:47, Timothy Arceri wrote:
>> >> >> Reviewed-by: Timothy Arceri <tarceri at itsqueeze.com>
>> >> >>
>> >> >> Mark may want to consider adding some of the once a day type CI runs for
>> >> >> this. For example running the test suite for two consecutive runs on the
>> >> >> same build so that the second run uses the shader cache and also a
>> >> >> second run the uses MESA_GLSL=cache_fb to force testing of the cache
>> >> >> fallback path.
>> >> >
>> >> > Yeah. We discussed this previously, but I don't think it's been
>> >> > implemented yet. My opinion is that it could perhaps be a weekly test.
>> >>
>> >> This automation is implemented now. It found the issues reported in
>> >> https://bugs.freedesktop.org/show_bug.cgi?id=103988
>> >>
>> >> My opinion is that this test strategy is inadequate.
>> >
>> > Meaning you think we cannot enable i965 shader cache for the 18.0
>> > release?
>>
>> I haven't heard anyone express confidence that this feature is bug-free,
>> and I don't know on what basis that claim could be made.  I appreciate
>> that a lot of have manual testing has been done.  Do you feel confident
>> that the cache can be enabled for all mesa customers without disruption?
>
> If we had enabled it two months ago, then yes, we would have worked
> through the niche issues, or discovered the so-far-hidden major land
> mine. At this point, it's getting risky. In a month, I will say no.
>
>> > We are already ~1.5 months away from the next stable branch point. If
>> > we want to enable this in 18.0, we should be using this time to see if
>> > enabling the cache by default has some large unexpected side effect in
>> > our graphics stack, or breaks real-world apps.
>>
>> I agree.  We should encourage as many users as possible to enable the
>> shader cache in their environments.  At least one stable release should
>> be out with a working cache, where the feature can be enabled by those
>> who are eager to benefit from it.  I assume there will be a lot of them,
>> and they could flush out issues for everyone who has no idea what a
>> shader cache is.
>>
>> >> Adding a dimension to the test matrix has high cost, especially when
>> >> combined with other dimensions of the test matrix (does shader cache
>> >> need to be tested for 32/64 bit builds? For hswgt1/hswgt2/hswgt3e?).
>> >
>> > Are you saying this is too high cost to run per check-in? Do you need
>> > to disable it for the health of CI? I think I proposed that daily, or
>> > perhaps even weekly would be adequate.
>>
>> Certainly, the test time per line of shader cache code is massively
>> higher than any other feature, even if you run it just once a month.
>> Other features have tests that run in milliseconds, not 30min * 20
>> machines.
>
> The scope of this feature is nearly the entire API. It is justified to
> throw the entire GL suite of tests at it on a regular basis. The cost
> of running this once per week ought to be reasonable.

But the entire API boils down to a comparatively small set of
non-orthogonal state. The configuration of those nobs seems to me like
the place things are most likely to break.

I think there's value in testing that we're hitting the cache, but if
we're not it's not a functional regression. I'm more concerned about
ensuring we don't have bugs that affect functionality and cause things
to break.

The program key for fragment shaders looks like:

/** The program key for Fragment/Pixel Shaders. */
struct brw_wm_prog_key {
   /* Some collection of BRW_WM_IZ_* */
   uint8_t iz_lookup;
   bool stats_wm:1;
   bool flat_shade:1;
   unsigned nr_color_regions:5;
   bool replicate_alpha:1;
   bool clamp_fragment_color:1;
   bool persample_interp:1;
   bool multisample_fbo:1;
   bool frag_coord_adds_sample_pos:1;
   enum brw_wm_aa_enable line_aa:2;
   bool high_quality_derivatives:1;
   bool force_dual_color_blend:1;
   bool coherent_fb_fetch:1;

   uint64_t input_slots_valid;
   unsigned program_string_id;
   GLenum alpha_test_func;          /* < For Gen4/5 MRT alpha test */
   float alpha_test_ref;

   struct brw_sampler_prog_key_data tex;
};

and it's the most complex of the shader stages. Wouldn't you feel a
lot safer if we just had a piglit test for each of those nobs that
compiled a shader, then changed some non-orthogonal state, and then
rendered with it, thus confirming that it didn't get the wrong shader
program out of the cache?

I know I've run into cases numerous times where piglit doesn't
actually test something, or only one seemingly unrelated test in all
50 thousand tickles a real bug in my code. I feel uncomfortable
assuming that piglit's existing coverage is good enough.

In general, I'm still bothered by the idea of just running piglit
twice. That's a very untargeted approach that doesn't fit well with
our existing testing model, as Mark has described.