[Intel-gfx] [RFC 4/4] drm/i915: Expose RPCS (SSEU) configuration to userspace

Tue May 2 15:00:27 UTC 2017

On 05/02/2017 07:55 PM, Chris Wilson wrote:
> On Tue, May 02, 2017 at 10:33:19AM +0000, Oscar Mateo wrote:
>>
>> On 05/02/2017 11:49 AM, Chris Wilson wrote:
>>> We want to allow userspace to reconfigure the subslice configuration for
>>> its own use case. To do so, we expose a context parameter to allow
>>> adjustment of the RPCS register stored within the context image (and
>>> currently not accessible via LRI).
>> Userspace could also do this by themselves via LRI if we simply
>> whitelist GEN8_R_PWR_CLK_STATE.
>>
>> Hardware people suggested this programming model:
>>
>> - PIPECONTROL - Stalling flish, flush all caches (color, depth, DC$)
>> - LOAD_REGISTER_IMMEDIATE - R_PWR_CLK_STATE
>> - Reprogram complete state
> Hmm, treating it as a complete state wipe is a nuisance, but fairly
> trivial. The simplest way will be for the user to execute the LRI batch
> as part of creating the context. But there will be some use cases where
> dynamic reconfiguration within an active context will be desired, I'm
> sure.

Exactly, in this way the UMD gets the best of both worlds: they can do 
the LRI once and forget about it, or they can reconfigure on-demand.

>>> If the context is adjusted before
>>> first use, the adjustment is for "free"; otherwise if the context is
>>> active we flush the context off the GPU (stalling all users) and forcing
>>> the GPU to save the context to memory where we can modify it and so
>>> ensure that the register is reloaded on next execution.
>> There is another cost associated with the adjustment: slice poweron
>> and shutdown do take some time to happen (in the order of tens of
>> usecs). I have been playing with an i-g-t benchmark to measure this
>> delay, I'll send it to the mailing list.
> Hmm, I thought the argument for why selecting smaller subslices gave
> better performance was that it was restoring the whole set between
> contexts, even when the configuration between contexts was the same.

Hmmm... it's the first time I hear that particular argument. I can 
definitely see the delay when changing the configuration (also, powering 
slices on takes a little bit more than switching them down) but no 
difference when I am just switching between contexts with the same 
configuration.
Until now, the most convincing argument I've heard is that thread 
scheduling is much more efficient with just one slice when you don't 
really need more, but maybe that doesn't explain the whole picture.

> As always numbers demonstrating the advantage, perhaps explaining why
> it helps, and also for spotting when we break it are most welcome :)
> -Chris

I can provide numbers for the slice configuration delay (numbers that 
have to be taken into account by the UMD when deciding which 
configuration to use) but I think Dimitry is in a better position to 
provide numbers for the advantage.