[Intel-gfx] [PATCH] drm/i915: Allow null render state batchbuffers bigger than one page

Thu Oct 12 22:31:32 UTC 2017

On Tue, Oct 10, 2017 at 10:29:41AM +0000, Chris Wilson wrote:
> Quoting Chris Wilson (2017-10-10 11:25:38)
> > Quoting Rodrigo Vivi (2017-10-05 05:34:02)
> > > On Thu, Aug 24, 2017 at 11:00:27PM +0000, Rodrigo Vivi wrote:
> > > > On Thu, Aug 24, 2017 at 3:39 PM, Oscar Mateo <oscar.mateo at intel.com> wrote:
> > > > >
> > > > >
> > > > > On 08/23/2017 05:01 PM, Rodrigo Vivi wrote:
> > > > >>
> > > > >> On Tue, Jul 18, 2017 at 8:15 AM, Oscar Mateo <oscar.mateo at intel.com>
> > > > >> wrote:
> > > > >>>
> > > > >>>
> > > > >>>
> > > > >>> On 07/14/2017 08:08 AM, Chris Wilson wrote:
> > > > >>>>
> > > > >>>> Quoting Oscar Mateo (2017-07-14 15:52:59)
> > > > >>>>>
> > > > >>>>>
> > > > >>>>>
> > > > >>>>> On 07/13/2017 03:28 PM, Rodrigo Vivi wrote:
> > > > >>>>>>
> > > > >>>>>> On Wed, May 3, 2017 at 9:31 AM, Chris Wilson
> > > > >>>>>> <chris at chris-wilson.co.uk>
> > > > >>>>>> wrote:
> > > > >>>>>>>
> > > > >>>>>>> On Wed, May 03, 2017 at 09:12:18AM +0000, Oscar Mateo wrote:
> > > > >>>>>>>>
> > > > >>>>>>>>       On 05/03/2017 08:52 AM, Mika Kuoppala wrote:
> > > > >>>>>>>>
> > > > >>>>>>>>     Oscar Mateo [1]<oscar.mateo at intel.com> writes:
> > > > >>>>>>>>
> > > > >>>>>>>>
> > > > >>>>>>>>     On 05/02/2017 09:17 AM, Mika Kuoppala wrote:
> > > > >>>>>>>>
> > > > >>>>>>>>     Chris Wilson [2]<chris at chris-wilson.co.uk> writes:
> > > > >>>>>>>>
> > > > >>>>>>>>
> > > > >>>>>>>>     On Fri, Apr 28, 2017 at 09:11:06AM +0000, Oscar Mateo wrote:
> > > > >>>>>>>>
> > > > >>>>>>>>     The new batchbuffer for CNL surpasses the 4096 byte mark.
> > > > >>>>>>>>
> > > > >>>>>>>>     Cc: Mika Kuoppala [3]<mika.kuoppala at intel.com>
> > > > >>>>>>>>     Cc: Ben Widawsky [4]<ben at bwidawsk.net>
> > > > >>>>>>>>     Signed-off-by: Oscar Mateo [5]<oscar.mateo at intel.com>
> > > > >>>>>>>>
> > > > >>>>>>>>     Evil, 4k+ of nothing-ness that userspace then has to configure
> > > > >>>>>>>> for
> > > > >>>>>>>> itself
> > > > >>>>>>>>     for correctness anyway.
> > > > >>>>>>>>
> > > > >>>>>>>>     Patch looks ok, but still question the sanity.
> > > > >>>>>>>>
> > > > >>>>>>>>     Is there a requirement for CNL to init the renderstate?
> > > > >>>>>>>>
> > > > >>>>>>>>     I would like to drop the render state init from CNL if
> > > > >>>>>>>>     we can't find evidence that it needs it. Bspec indicates
> > > > >>>>>>>>     that it doesnt.
> > > > >>>>>>
> > > > >>>>>> I'd like to drop as well, and I was hearing people around telling we
> > > > >>>>>> didn't need anymore,
> > > > >>>>>> however without this during power on I had bad failures...
> > > > >>>>>>
> > > > >>>>> The best I could get from architecture (+Raf) is that setting valid and
> > > > >>>>> coherent values for the whole render state is required as soon as the
> > > > >>>>> context is created, no matter who does it. If you see failures when the
> > > > >>>>> KMD does not do it, that means the UMD must be missing something,
> > > > >>>>> right?
> > > > >>>>
> > > > >>>> That is my initial response as well. The kernel does load one context,
> > > > >>>> just so that the hardware always has space to write to on power saving.
> > > > >>>> The only batch executed for it is the golden render state. Easy enough
> > > > >>>> to only initialise that kernel context to isolate whether it is
> > > > >>>> self-inflicted or that userspace overlooked something in its state
> > > > >>>> management. (I have the view that even if userspace doesn't think it
> > > > >>>> needs to use a particular bit of state today, tomorrow it will so will
> > > > >>>> need it anyway!)
> > > > >>>> -Chris
> > > > >>>
> > > > >>>
> > > > >>> Rodrigo, you have access to a CNL: can you make this test? The idea is to
> > > > >>> find out if the root cause for the failures you were seeing is the kernel
> > > > >>> default context or in the UMD-created contexts.
> > > > >>
> > > > >> I'm sorry for the delay on this one.
> > > > >>
> > > > >> On the parts I have now I couldn't reproduce the issues I saw during
> > > > >> power-on
> > > > >> where null context helped.
> > > > >>
> > > > >> But anyways apparently we need this right?!
> > > > >>
> > > > >> What about the 4k+ sanity that Chris raised? Anything we should address
> > > > >> first?
> > > > >
> > > > >
> > > > > I don't think Chris had any problem with the batchbuffer being bigger than
> > > > > 4k per se. His concern was: "why do we need to send this batchbuffer from
> > > > > the KMD at all if the UMD has to send something very similar anyway?".
> > > > > Even if this was true (I haven't found anybody to confirm or deny it) there
> > > > > is still the question of the kernel context (which would never get
> > > > > initialized to valid values by the UMD).
> > > > 
> > > > so, chris, rv-b? acked-by?
> > > 
> > > chris, mika, oscar...
> > > what should we do with this?
> > > just discard, ignore and move on without the null context for gen10+?
> > 
> > If there's no requirement for us to have it, then let's break the cargo
> > cult. Certainly userspace does not expect 3DSTATE to have any default
> > value, unlike the defaults specified for mmio state (which is currently
> > causing a huge upset). It's only if the bspec has wording that makes
> > certain valid 3DSTATE (or GPGPU or MEDIA) mandatory for powercontext etc
> > do we have to worry.
> 
> The other angle is that the proto context is entirely defined by us. New
> userspace contexts should not see any state that is outside of the
> context construction (either directly specified inside the image or
> implicitly from priv registers). In essence for lrc, we already define
> the golden render state but call it a context image instead.
> -Chris

So, are you saying there is absolutely no risk of one userspace component
leaving garbage on any of these registers and other component assuming it
is null or valid do some RMW and end up with wrong setup?

I believe in the past there were cases like this between Mesa and Libva.

And if issues like this starts to appear back than apparently
the debug is harder because it would be random the garbage left behind.

I understand the cargo part, but with many different userspaces out there
using the GPU, the cost of kernel assuring the null is really low
compared with the stability it can bring without relying on userspace.

I understand your part of breaking the cargo. But my doubt is, if we stop this
after many years clearing this up we expect userspaces go ahead and all
of them modify all their code to not make any assumptions on CNL+
regarding those states?

Thanks,
Rodrigo.