[Mesa-dev] [PATCH] i965: Enable disk shader cache by default

Wed Dec 6 02:30:30 UTC 2017

Timothy Arceri <tarceri at itsqueeze.com> writes:

> On 06/12/17 12:04, Mark Janes wrote:
>> Jordan Justen <jordan.l.justen at intel.com> writes:
>> 
>>> On 2017-12-05 14:49:28, Mark Janes wrote:
>>>> Jordan Justen <jordan.l.justen at intel.com> writes:
>>>>
>>>>> On 2017-12-05 09:13:11, Mark Janes wrote:
>>>>>> Jordan Justen <jordan.l.justen at intel.com> writes:
>>>>>>
>>>>>>> On 2017-11-08 17:26:47, Timothy Arceri wrote:
>>>>>>>> Reviewed-by: Timothy Arceri <tarceri at itsqueeze.com>
>>>>>>>>
>>>>>>>> Mark may want to consider adding some of the once a day type CI runs for
>>>>>>>> this. For example running the test suite for two consecutive runs on the
>>>>>>>> same build so that the second run uses the shader cache and also a
>>>>>>>> second run the uses MESA_GLSL=cache_fb to force testing of the cache
>>>>>>>> fallback path.
>>>>>>>
>>>>>>> Yeah. We discussed this previously, but I don't think it's been
>>>>>>> implemented yet. My opinion is that it could perhaps be a weekly test.
>>>>>>
>>>>>> This automation is implemented now. It found the issues reported in
>>>>>> https://bugs.freedesktop.org/show_bug.cgi?id=103988
>>>>>>
>>>>>> My opinion is that this test strategy is inadequate.
>>>>>
>>>>> Meaning you think we cannot enable i965 shader cache for the 18.0
>>>>> release?
>>>>
>>>> I haven't heard anyone express confidence that this feature is bug-free,
>>>> and I don't know on what basis that claim could be made.  I appreciate
>>>> that a lot of have manual testing has been done.  Do you feel confident
>>>> that the cache can be enabled for all mesa customers without disruption?
>>>
>>> If we had enabled it two months ago, then yes, we would have worked
>>> through the niche issues, or discovered the so-far-hidden major land
>>> mine. At this point, it's getting risky. In a month, I will say no.
>>>
>>>>> We are already ~1.5 months away from the next stable branch point. If
>>>>> we want to enable this in 18.0, we should be using this time to see if
>>>>> enabling the cache by default has some large unexpected side effect in
>>>>> our graphics stack, or breaks real-world apps.
>>>>
>>>> I agree.  We should encourage as many users as possible to enable the
>>>> shader cache in their environments.  At least one stable release should
>>>> be out with a working cache, where the feature can be enabled by those
>>>> who are eager to benefit from it.  I assume there will be a lot of them,
>>>> and they could flush out issues for everyone who has no idea what a
>>>> shader cache is.
>>>>
>>>>>> Adding a dimension to the test matrix has high cost, especially when
>>>>>> combined with other dimensions of the test matrix (does shader cache
>>>>>> need to be tested for 32/64 bit builds? For hswgt1/hswgt2/hswgt3e?).
>>>>>
>>>>> Are you saying this is too high cost to run per check-in? Do you need
>>>>> to disable it for the health of CI? I think I proposed that daily, or
>>>>> perhaps even weekly would be adequate.
>>>>
>>>> Certainly, the test time per line of shader cache code is massively
>>>> higher than any other feature, even if you run it just once a month.
>>>> Other features have tests that run in milliseconds, not 30min * 20
>>>> machines.
>>>
>>> The scope of this feature is nearly the entire API. It is justified to
>>> throw the entire GL suite of tests at it on a regular basis. The cost
>>> of running this once per week ought to be reasonable.
>>>
>>> Is the cost of running this per check-in too high for the system
>>> today? Or, is it that you think it is too high for the feature? I'm
>>> sympathetic to the former, and not so much to the later. :)
>>>
>>> By the way, enabling these in CI has been helpful in highlighting a
>>> few corner case issues. So, even if you don't like it, *Thank You* for
>>> enabling it. :)
>>>
>>>> Beyond poor return on execution time, there is the simple fact that
>>>> whoever is running the CI needs to manually look at shader cache results
>>>> separately from the rest of the tests.  Unit testing is effective
>>>> because coverage can be added at approximately zero marginal cost.
>>>>
>>>> 3 years from now, will we still be looking at separate weekly shader
>>>> cache test runs to make sure it's working?
>>>
>>> I actually think that long term this should become part of the main
>>> daily and weekly tests. (An idea you've already rejected.) In the long
>>> term it should be stable enough for that, and if it does ever regress,
>>> we'd what to see something sooner rather than later.
>>>
>>>>> These tests are already identifying some corner case issues. I'm not
>>>>> sure these would have impacted real-world apps yet, but I do think it
>>>>> is a good idea to run these tests regularly.
>>>>>
>>>>> You say this test strategy is inadequate. Perhaps. I do think it needs
>>>>> to be part of our test strategy. There is no way we are going to hit
>>>>> all the corners of the API better than running all of our tests with
>>>>> the cache enabled. Do you agree?
>>>>
>>>> Tests should strive to cover the implementation, not the API.
>>>
>>> I don't think any of our test suites operate this way. Very little of
>>> piglit or deqp focus on i965.
>>>
>>>> Shader
>>>> cache is a unique feature, because it affects a large part of the API.
>>>> It doesn't necessarily follow that covering the API will cover the
>>>> feature, or that that is an effective test strategy.
>>>
>>> As you say, the scope of the feature is nearly the entire API. You
>>> want us to start building a test suite that assumes the current
>>> implementation details of the i965 shader cache, and tests it?
>>>
>>> Regardless of the choice to focus only on i965, the scope will still
>>> be enormous. We will never feasibly reach the coverage level that we
>>> get from enabling the shader cache paths, and running our current
>>> hundreds of thousands of test cases on it.
>>>
>>>>> It could be interesting to define a MESA extension to control or query
>>>>> the shader cache. Today, the shader cache is a nearly 'invisible'
>>>>> feature. There are a few env-vars, but I wouldn't call a public
>>>>> interface.
>>>>>
>>>>> The downside of defining an extension is that it might constrain
>>>>> changes to the shader cache implementation. Or, if the interface is
>>>>> too complex, then it may be tough for some drivers to implement.
>>>>
>>>> Why is an extension necessary?  By comparison, GCC has no interface to
>>>> query the statistics for ccache.  A utility that can read the cache
>>>> files and report hits/misses would at least inform us that the cache is
>>>> functioning.  The utility would be useful in writing real unit tests for
>>>> the feature.
>>>
>>> If we don't have an extension, then how would this work? i965 unit
>>> tests during the build? If we don't actually test the feature against
>>> real hardware, I think we'll miss a whole class of potential issues.
>> 
>> Unit tests are good, and I know we already have some of those.  We'll
>> have better data on the classes of issues are once we find more of
>> them.  So far, it's just a few hardware-independent piglit tests.
>> 
>>> If we don't have an extension, then how can an external (piglit) test
>>> hope to interact or probe the cache?
>> 
>> Piglit queries a utility or lib to make assertions about the cache?  For
>> that matter, this whole cache verification could be implemented within
>> piglit as a profile that executes tests twice with appropriate
>> verification.
>
> That would mean manual work to add tests to a profile. The manual part 
> makes it unlikely to ever happen, also trying to manually update a list 
> to test every variation of shader is doomed to fail. Alternatively 
> adding only tests that have found a problem in the past makes it not 
> very useful.

Are there not some cpu tests that can be easily dropped from the
profile?

>> 
>> Then, at least, developers could test their patches without a CI.
>> Piglit could even skip a bunch of tests which don't impact the cache.
>
> They already can, they just need to run piglit twice. If thats not easy 
> enough we can add a profile that causes everything to be run twice.
>
>> 
>>> I'm not actually in favor of defining and implementing an extension,
>>> because I'm not sure about the ROI.
>>>
>>>>> But, what if we get around to defining and implementing this extenion,
>>>>> (and a few piglit tests that makes use of it), by mid-January? It'll
>>>>> still be pretty difficult to argue we should enable the shader cache
>>>>> on i965 for 18.0, since we burned all the time to let it be tested on
>>>>> master against real-world apps.
>>>>
>>>> Personally, I think it is already difficult to argue that the cache
>>>> should be on by default in 18.0.
>>>
>>> Ok. I guess I'll check back with you sometime in 2018...
>>>
>>>>> Also, while you say our current test strategy is inadequate, I do
>>>>> think it is still far more comprehensive than anything we will
>>>>> accomplish by defining a new extension and writing a few, or even a
>>>>> few hundred tests against the new extension.
>>>>
>>>> In the short term, we should do whatever is expedient to try to test the
>>>> feature as much as we can.
>>>>
>>>> In the longer term, a weekly test is going to be costly and ineffective
>>>> at preventing regressions.  Tests need to be runnable for developers to
>>>> use them as part of their process.  If a test can only be run at great
>>>> expense in a weekly CI, then any fragility in the feature will be
>>>> discovered a week after a bug has been pushed to master.
>>>
>>> Given the scope of the feature, I just don't think it is going to be
>>> possible to get reasonable coverage on a developer's machine.
>>>
>>> Once an issue has been identified, it has generally been easy to
>>> reproduce locally.
>>>
>>> -Jordan
>> _______________________________________________
>> mesa-dev mailing list
>> mesa-dev at lists.freedesktop.org
>> https://lists.freedesktop.org/mailman/listinfo/mesa-dev
>>