[Mesa-dev] [PATCH] i965: Enable disk shader cache by default

Wed Dec 6 01:21:15 UTC 2017

On 06/12/17 12:04, Mark Janes wrote:
> Jordan Justen <jordan.l.justen at intel.com> writes:
> 
>> On 2017-12-05 14:49:28, Mark Janes wrote:
>>> Jordan Justen <jordan.l.justen at intel.com> writes:
>>>
>>>> On 2017-12-05 09:13:11, Mark Janes wrote:
>>>>> Jordan Justen <jordan.l.justen at intel.com> writes:
>>>>>
>>>>>> On 2017-11-08 17:26:47, Timothy Arceri wrote:
>>>>>>> Reviewed-by: Timothy Arceri <tarceri at itsqueeze.com>
>>>>>>>
>>>>>>> Mark may want to consider adding some of the once a day type CI runs for
>>>>>>> this. For example running the test suite for two consecutive runs on the
>>>>>>> same build so that the second run uses the shader cache and also a
>>>>>>> second run the uses MESA_GLSL=cache_fb to force testing of the cache
>>>>>>> fallback path.
>>>>>>
>>>>>> Yeah. We discussed this previously, but I don't think it's been
>>>>>> implemented yet. My opinion is that it could perhaps be a weekly test.
>>>>>
>>>>> This automation is implemented now. It found the issues reported in
>>>>> https://bugs.freedesktop.org/show_bug.cgi?id=103988
>>>>>
>>>>> My opinion is that this test strategy is inadequate.
>>>>
>>>> Meaning you think we cannot enable i965 shader cache for the 18.0
>>>> release?
>>>
>>> I haven't heard anyone express confidence that this feature is bug-free,
>>> and I don't know on what basis that claim could be made.  I appreciate
>>> that a lot of have manual testing has been done.  Do you feel confident
>>> that the cache can be enabled for all mesa customers without disruption?
>>
>> If we had enabled it two months ago, then yes, we would have worked
>> through the niche issues, or discovered the so-far-hidden major land
>> mine. At this point, it's getting risky. In a month, I will say no.
>>
>>>> We are already ~1.5 months away from the next stable branch point. If
>>>> we want to enable this in 18.0, we should be using this time to see if
>>>> enabling the cache by default has some large unexpected side effect in
>>>> our graphics stack, or breaks real-world apps.
>>>
>>> I agree.  We should encourage as many users as possible to enable the
>>> shader cache in their environments.  At least one stable release should
>>> be out with a working cache, where the feature can be enabled by those
>>> who are eager to benefit from it.  I assume there will be a lot of them,
>>> and they could flush out issues for everyone who has no idea what a
>>> shader cache is.
>>>
>>>>> Adding a dimension to the test matrix has high cost, especially when
>>>>> combined with other dimensions of the test matrix (does shader cache
>>>>> need to be tested for 32/64 bit builds? For hswgt1/hswgt2/hswgt3e?).
>>>>
>>>> Are you saying this is too high cost to run per check-in? Do you need
>>>> to disable it for the health of CI? I think I proposed that daily, or
>>>> perhaps even weekly would be adequate.
>>>
>>> Certainly, the test time per line of shader cache code is massively
>>> higher than any other feature, even if you run it just once a month.
>>> Other features have tests that run in milliseconds, not 30min * 20
>>> machines.
>>
>> The scope of this feature is nearly the entire API. It is justified to
>> throw the entire GL suite of tests at it on a regular basis. The cost
>> of running this once per week ought to be reasonable.
>>
>> Is the cost of running this per check-in too high for the system
>> today? Or, is it that you think it is too high for the feature? I'm
>> sympathetic to the former, and not so much to the later. :)
>>
>> By the way, enabling these in CI has been helpful in highlighting a
>> few corner case issues. So, even if you don't like it, *Thank You* for
>> enabling it. :)
>>
>>> Beyond poor return on execution time, there is the simple fact that
>>> whoever is running the CI needs to manually look at shader cache results
>>> separately from the rest of the tests.  Unit testing is effective
>>> because coverage can be added at approximately zero marginal cost.
>>>
>>> 3 years from now, will we still be looking at separate weekly shader
>>> cache test runs to make sure it's working?
>>
>> I actually think that long term this should become part of the main
>> daily and weekly tests. (An idea you've already rejected.) In the long
>> term it should be stable enough for that, and if it does ever regress,
>> we'd what to see something sooner rather than later.
>>
>>>> These tests are already identifying some corner case issues. I'm not
>>>> sure these would have impacted real-world apps yet, but I do think it
>>>> is a good idea to run these tests regularly.
>>>>
>>>> You say this test strategy is inadequate. Perhaps. I do think it needs
>>>> to be part of our test strategy. There is no way we are going to hit
>>>> all the corners of the API better than running all of our tests with
>>>> the cache enabled. Do you agree?
>>>
>>> Tests should strive to cover the implementation, not the API.
>>
>> I don't think any of our test suites operate this way. Very little of
>> piglit or deqp focus on i965.
>>
>>> Shader
>>> cache is a unique feature, because it affects a large part of the API.
>>> It doesn't necessarily follow that covering the API will cover the
>>> feature, or that that is an effective test strategy.
>>
>> As you say, the scope of the feature is nearly the entire API. You
>> want us to start building a test suite that assumes the current
>> implementation details of the i965 shader cache, and tests it?
>>
>> Regardless of the choice to focus only on i965, the scope will still
>> be enormous. We will never feasibly reach the coverage level that we
>> get from enabling the shader cache paths, and running our current
>> hundreds of thousands of test cases on it.
>>
>>>> It could be interesting to define a MESA extension to control or query
>>>> the shader cache. Today, the shader cache is a nearly 'invisible'
>>>> feature. There are a few env-vars, but I wouldn't call a public
>>>> interface.
>>>>
>>>> The downside of defining an extension is that it might constrain
>>>> changes to the shader cache implementation. Or, if the interface is
>>>> too complex, then it may be tough for some drivers to implement.
>>>
>>> Why is an extension necessary?  By comparison, GCC has no interface to
>>> query the statistics for ccache.  A utility that can read the cache
>>> files and report hits/misses would at least inform us that the cache is
>>> functioning.  The utility would be useful in writing real unit tests for
>>> the feature.
>>
>> If we don't have an extension, then how would this work? i965 unit
>> tests during the build? If we don't actually test the feature against
>> real hardware, I think we'll miss a whole class of potential issues.
> 
> Unit tests are good, and I know we already have some of those.  We'll
> have better data on the classes of issues are once we find more of
> them.  So far, it's just a few hardware-independent piglit tests.
> 
>> If we don't have an extension, then how can an external (piglit) test
>> hope to interact or probe the cache?
> 
> Piglit queries a utility or lib to make assertions about the cache?  For
> that matter, this whole cache verification could be implemented within
> piglit as a profile that executes tests twice with appropriate
> verification.

That would mean manual work to add tests to a profile. The manual part 
makes it unlikely to ever happen, also trying to manually update a list 
to test every variation of shader is doomed to fail. Alternatively 
adding only tests that have found a problem in the past makes it not 
very useful.

> 
> Then, at least, developers could test their patches without a CI.
> Piglit could even skip a bunch of tests which don't impact the cache.

They already can, they just need to run piglit twice. If thats not easy 
enough we can add a profile that causes everything to be run twice.

> 
>> I'm not actually in favor of defining and implementing an extension,
>> because I'm not sure about the ROI.
>>
>>>> But, what if we get around to defining and implementing this extenion,
>>>> (and a few piglit tests that makes use of it), by mid-January? It'll
>>>> still be pretty difficult to argue we should enable the shader cache
>>>> on i965 for 18.0, since we burned all the time to let it be tested on
>>>> master against real-world apps.
>>>
>>> Personally, I think it is already difficult to argue that the cache
>>> should be on by default in 18.0.
>>
>> Ok. I guess I'll check back with you sometime in 2018...
>>
>>>> Also, while you say our current test strategy is inadequate, I do
>>>> think it is still far more comprehensive than anything we will
>>>> accomplish by defining a new extension and writing a few, or even a
>>>> few hundred tests against the new extension.
>>>
>>> In the short term, we should do whatever is expedient to try to test the
>>> feature as much as we can.
>>>
>>> In the longer term, a weekly test is going to be costly and ineffective
>>> at preventing regressions.  Tests need to be runnable for developers to
>>> use them as part of their process.  If a test can only be run at great
>>> expense in a weekly CI, then any fragility in the feature will be
>>> discovered a week after a bug has been pushed to master.
>>
>> Given the scope of the feature, I just don't think it is going to be
>> possible to get reasonable coverage on a developer's machine.
>>
>> Once an issue has been identified, it has generally been easy to
>> reproduce locally.
>>
>> -Jordan
> _______________________________________________
> mesa-dev mailing list
> mesa-dev at lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/mesa-dev
>