[Mesa-dev] [PATCH] i965: Enable disk shader cache by default

Thu Dec 7 23:38:35 UTC 2017

On 08/12/17 04:57, Matt Turner wrote:
> On Tue, Dec 5, 2017 at 3:40 PM, Jordan Justen <jordan.l.justen at intel.com> wrote:
>> On 2017-12-05 14:49:28, Mark Janes wrote:
>>> Jordan Justen <jordan.l.justen at intel.com> writes:
>>>
>>>> On 2017-12-05 09:13:11, Mark Janes wrote:
>>>>> Jordan Justen <jordan.l.justen at intel.com> writes:
>>>>>
>>>>>> On 2017-11-08 17:26:47, Timothy Arceri wrote:
>>>>>>> Reviewed-by: Timothy Arceri <tarceri at itsqueeze.com>
>>>>>>>
>>>>>>> Mark may want to consider adding some of the once a day type CI runs for
>>>>>>> this. For example running the test suite for two consecutive runs on the
>>>>>>> same build so that the second run uses the shader cache and also a
>>>>>>> second run the uses MESA_GLSL=cache_fb to force testing of the cache
>>>>>>> fallback path.
>>>>>>
>>>>>> Yeah. We discussed this previously, but I don't think it's been
>>>>>> implemented yet. My opinion is that it could perhaps be a weekly test.
>>>>>
>>>>> This automation is implemented now. It found the issues reported in
>>>>> https://bugs.freedesktop.org/show_bug.cgi?id=103988
>>>>>
>>>>> My opinion is that this test strategy is inadequate.
>>>>
>>>> Meaning you think we cannot enable i965 shader cache for the 18.0
>>>> release?
>>>
>>> I haven't heard anyone express confidence that this feature is bug-free,
>>> and I don't know on what basis that claim could be made.  I appreciate
>>> that a lot of have manual testing has been done.  Do you feel confident
>>> that the cache can be enabled for all mesa customers without disruption?
>>
>> If we had enabled it two months ago, then yes, we would have worked
>> through the niche issues, or discovered the so-far-hidden major land
>> mine. At this point, it's getting risky. In a month, I will say no.
>>
>>>> We are already ~1.5 months away from the next stable branch point. If
>>>> we want to enable this in 18.0, we should be using this time to see if
>>>> enabling the cache by default has some large unexpected side effect in
>>>> our graphics stack, or breaks real-world apps.
>>>
>>> I agree.  We should encourage as many users as possible to enable the
>>> shader cache in their environments.  At least one stable release should
>>> be out with a working cache, where the feature can be enabled by those
>>> who are eager to benefit from it.  I assume there will be a lot of them,
>>> and they could flush out issues for everyone who has no idea what a
>>> shader cache is.
>>>
>>>>> Adding a dimension to the test matrix has high cost, especially when
>>>>> combined with other dimensions of the test matrix (does shader cache
>>>>> need to be tested for 32/64 bit builds? For hswgt1/hswgt2/hswgt3e?).
>>>>
>>>> Are you saying this is too high cost to run per check-in? Do you need
>>>> to disable it for the health of CI? I think I proposed that daily, or
>>>> perhaps even weekly would be adequate.
>>>
>>> Certainly, the test time per line of shader cache code is massively
>>> higher than any other feature, even if you run it just once a month.
>>> Other features have tests that run in milliseconds, not 30min * 20
>>> machines.
>>
>> The scope of this feature is nearly the entire API. It is justified to
>> throw the entire GL suite of tests at it on a regular basis. The cost
>> of running this once per week ought to be reasonable.
> 
> But the entire API boils down to a comparatively small set of
> non-orthogonal state. The configuration of those nobs seems to me like
> the place things are most likely to break.
> 
> I think there's value in testing that we're hitting the cache, but if
> we're not it's not a functional regression. I'm more concerned about
> ensuring we don't have bugs that affect functionality and cause things
> to break.
> 
> The program key for fragment shaders looks like:
> 
> /** The program key for Fragment/Pixel Shaders. */
> struct brw_wm_prog_key {
>     /* Some collection of BRW_WM_IZ_* */
>     uint8_t iz_lookup;
>     bool stats_wm:1;
>     bool flat_shade:1;
>     unsigned nr_color_regions:5;
>     bool replicate_alpha:1;
>     bool clamp_fragment_color:1;
>     bool persample_interp:1;
>     bool multisample_fbo:1;
>     bool frag_coord_adds_sample_pos:1;
>     enum brw_wm_aa_enable line_aa:2;
>     bool high_quality_derivatives:1;
>     bool force_dual_color_blend:1;
>     bool coherent_fb_fetch:1;
> 
>     uint64_t input_slots_valid;
>     unsigned program_string_id;
>     GLenum alpha_test_func;          /* < For Gen4/5 MRT alpha test */
>     float alpha_test_ref;
> 
>     struct brw_sampler_prog_key_data tex;
> };
> 
> and it's the most complex of the shader stages. Wouldn't you feel a
> lot safer if we just had a piglit test for each of those nobs that
> compiled a shader, then changed some non-orthogonal state, and then
> rendered with it, thus confirming that it didn't get the wrong shader
> program out of the cache?

Those are just the keys for i965s gen program. You also should be 
testing the serialization and deserialization of the GL state and the IR 
(in i965s case NIR). For NIR you could turn on an ENV var to do a 
serialization/deserialization for every shader, I think something like 
that is done for nir clone.

> 
> I know I've run into cases numerous times where piglit doesn't
> actually test something, or only one seemingly unrelated test in all
> 50 thousand tickles a real bug in my code. I feel uncomfortable
> assuming that piglit's existing coverage is good enough.

I don't see how this is related to the discussion, isn't this an 
existing problem? These keys are already in use for the in-memory cache.

> 
> In general, I'm still bothered by the idea of just running piglit
> twice. That's a very untargeted approach that doesn't fit well with
> our existing testing model, as Mark has described.

IMO the problem is nobody is going to be writing tests that target the 
cache, running things twice means we automatically get to make use of 
any new tests added for any new feature, without this I just see things 
breaking until someone notices it and then the bug is likely to be fix 
without a specific shader cache test being written. You say above "I 
know I've run into cases numerous times where piglit doesn't actually 
test something", while running piglit twice may be an untargeted it 
simplicity is its appeal in a test system that is already under 
resourced and has coverage gaps.