[Mesa-dev] [PATCH] i965: Enable disk shader cache by default

Wed Dec 6 01:04:34 UTC 2017

Jordan Justen <jordan.l.justen at intel.com> writes:

> On 2017-12-05 14:49:28, Mark Janes wrote:
>> Jordan Justen <jordan.l.justen at intel.com> writes:
>> 
>> > On 2017-12-05 09:13:11, Mark Janes wrote:
>> >> Jordan Justen <jordan.l.justen at intel.com> writes:
>> >> 
>> >> > On 2017-11-08 17:26:47, Timothy Arceri wrote:
>> >> >> Reviewed-by: Timothy Arceri <tarceri at itsqueeze.com>
>> >> >> 
>> >> >> Mark may want to consider adding some of the once a day type CI runs for 
>> >> >> this. For example running the test suite for two consecutive runs on the 
>> >> >> same build so that the second run uses the shader cache and also a 
>> >> >> second run the uses MESA_GLSL=cache_fb to force testing of the cache 
>> >> >> fallback path.
>> >> >
>> >> > Yeah. We discussed this previously, but I don't think it's been
>> >> > implemented yet. My opinion is that it could perhaps be a weekly test.
>> >> 
>> >> This automation is implemented now. It found the issues reported in
>> >> https://bugs.freedesktop.org/show_bug.cgi?id=103988
>> >> 
>> >> My opinion is that this test strategy is inadequate.
>> >
>> > Meaning you think we cannot enable i965 shader cache for the 18.0
>> > release?
>> 
>> I haven't heard anyone express confidence that this feature is bug-free,
>> and I don't know on what basis that claim could be made.  I appreciate
>> that a lot of have manual testing has been done.  Do you feel confident
>> that the cache can be enabled for all mesa customers without disruption?
>
> If we had enabled it two months ago, then yes, we would have worked
> through the niche issues, or discovered the so-far-hidden major land
> mine. At this point, it's getting risky. In a month, I will say no.
>
>> > We are already ~1.5 months away from the next stable branch point. If
>> > we want to enable this in 18.0, we should be using this time to see if
>> > enabling the cache by default has some large unexpected side effect in
>> > our graphics stack, or breaks real-world apps.
>> 
>> I agree.  We should encourage as many users as possible to enable the
>> shader cache in their environments.  At least one stable release should
>> be out with a working cache, where the feature can be enabled by those
>> who are eager to benefit from it.  I assume there will be a lot of them,
>> and they could flush out issues for everyone who has no idea what a
>> shader cache is.
>> 
>> >> Adding a dimension to the test matrix has high cost, especially when
>> >> combined with other dimensions of the test matrix (does shader cache
>> >> need to be tested for 32/64 bit builds? For hswgt1/hswgt2/hswgt3e?).
>> >
>> > Are you saying this is too high cost to run per check-in? Do you need
>> > to disable it for the health of CI? I think I proposed that daily, or
>> > perhaps even weekly would be adequate.
>> 
>> Certainly, the test time per line of shader cache code is massively
>> higher than any other feature, even if you run it just once a month.
>> Other features have tests that run in milliseconds, not 30min * 20
>> machines.
>
> The scope of this feature is nearly the entire API. It is justified to
> throw the entire GL suite of tests at it on a regular basis. The cost
> of running this once per week ought to be reasonable.
>
> Is the cost of running this per check-in too high for the system
> today? Or, is it that you think it is too high for the feature? I'm
> sympathetic to the former, and not so much to the later. :)
>
> By the way, enabling these in CI has been helpful in highlighting a
> few corner case issues. So, even if you don't like it, *Thank You* for
> enabling it. :)
>
>> Beyond poor return on execution time, there is the simple fact that
>> whoever is running the CI needs to manually look at shader cache results
>> separately from the rest of the tests.  Unit testing is effective
>> because coverage can be added at approximately zero marginal cost.
>> 
>> 3 years from now, will we still be looking at separate weekly shader
>> cache test runs to make sure it's working?
>
> I actually think that long term this should become part of the main
> daily and weekly tests. (An idea you've already rejected.) In the long
> term it should be stable enough for that, and if it does ever regress,
> we'd what to see something sooner rather than later.
>
>> > These tests are already identifying some corner case issues. I'm not
>> > sure these would have impacted real-world apps yet, but I do think it
>> > is a good idea to run these tests regularly.
>> >
>> > You say this test strategy is inadequate. Perhaps. I do think it needs
>> > to be part of our test strategy. There is no way we are going to hit
>> > all the corners of the API better than running all of our tests with
>> > the cache enabled. Do you agree?
>> 
>> Tests should strive to cover the implementation, not the API.
>
> I don't think any of our test suites operate this way. Very little of
> piglit or deqp focus on i965.
>
>> Shader
>> cache is a unique feature, because it affects a large part of the API.
>> It doesn't necessarily follow that covering the API will cover the
>> feature, or that that is an effective test strategy.
>
> As you say, the scope of the feature is nearly the entire API. You
> want us to start building a test suite that assumes the current
> implementation details of the i965 shader cache, and tests it?
>
> Regardless of the choice to focus only on i965, the scope will still
> be enormous. We will never feasibly reach the coverage level that we
> get from enabling the shader cache paths, and running our current
> hundreds of thousands of test cases on it.
>
>> > It could be interesting to define a MESA extension to control or query
>> > the shader cache. Today, the shader cache is a nearly 'invisible'
>> > feature. There are a few env-vars, but I wouldn't call a public
>> > interface.
>> >
>> > The downside of defining an extension is that it might constrain
>> > changes to the shader cache implementation. Or, if the interface is
>> > too complex, then it may be tough for some drivers to implement.
>> 
>> Why is an extension necessary?  By comparison, GCC has no interface to
>> query the statistics for ccache.  A utility that can read the cache
>> files and report hits/misses would at least inform us that the cache is
>> functioning.  The utility would be useful in writing real unit tests for
>> the feature.
>
> If we don't have an extension, then how would this work? i965 unit
> tests during the build? If we don't actually test the feature against
> real hardware, I think we'll miss a whole class of potential issues.

Unit tests are good, and I know we already have some of those.  We'll
have better data on the classes of issues are once we find more of
them.  So far, it's just a few hardware-independent piglit tests.

> If we don't have an extension, then how can an external (piglit) test
> hope to interact or probe the cache?

Piglit queries a utility or lib to make assertions about the cache?  For
that matter, this whole cache verification could be implemented within
piglit as a profile that executes tests twice with appropriate
verification.

Then, at least, developers could test their patches without a CI.
Piglit could even skip a bunch of tests which don't impact the cache.

> I'm not actually in favor of defining and implementing an extension,
> because I'm not sure about the ROI.
>
>> > But, what if we get around to defining and implementing this extenion,
>> > (and a few piglit tests that makes use of it), by mid-January? It'll
>> > still be pretty difficult to argue we should enable the shader cache
>> > on i965 for 18.0, since we burned all the time to let it be tested on
>> > master against real-world apps.
>> 
>> Personally, I think it is already difficult to argue that the cache
>> should be on by default in 18.0.
>
> Ok. I guess I'll check back with you sometime in 2018...
>
>> > Also, while you say our current test strategy is inadequate, I do
>> > think it is still far more comprehensive than anything we will
>> > accomplish by defining a new extension and writing a few, or even a
>> > few hundred tests against the new extension.
>> 
>> In the short term, we should do whatever is expedient to try to test the
>> feature as much as we can.
>> 
>> In the longer term, a weekly test is going to be costly and ineffective
>> at preventing regressions.  Tests need to be runnable for developers to
>> use them as part of their process.  If a test can only be run at great
>> expense in a weekly CI, then any fragility in the feature will be
>> discovered a week after a bug has been pushed to master.
>
> Given the scope of the feature, I just don't think it is going to be
> possible to get reasonable coverage on a developer's machine.
>
> Once an issue has been identified, it has generally been easy to
> reproduce locally.
>
> -Jordan