[Mesa-dev] [RFC] Linux Graphics Next: Explicit fences everywhere and no BO fences - initial proposal

Tue May 4 10:53:39 UTC 2021

Am 04.05.21 um 11:47 schrieb Daniel Vetter:
> [SNIP]
>> Yeah, it just takes to long for the preemption to complete to be really
>> useful for the feature we are discussing here.
>>
>> As I said when the kernel requests to preempt a queue we can easily expect a
>> timeout of ~100ms until that comes back. For compute that is even in the
>> multiple seconds range.
> 100ms for preempting an idle request sounds like broken hw to me. Of
> course preemting something that actually runs takes a while, that's
> nothing new. But it's also not the thing we're talking about here. Is this
> 100ms actual numbers from hw for an actual idle ringbuffer?

Well 100ms is just an example of the scheduler granularity. Let me 
explain in a wider context.

The hardware can have X queues mapped at the same time and every Y time 
interval the hardware scheduler checks if those queues have changed and 
only if they have changed the necessary steps to reload them are started.

Multiple queues can be rendering at the same time, so you can have X as 
a high priority queue active and just waiting for a signal to start and 
the client rendering one frame after another and a third background 
compute task mining bitcoins for you.

As long as everything is static this is perfectly performant. Adding a 
queue to the list of active queues is also relatively simple, but taking 
one down requires you to wait until we are sure the hardware has seen 
the change and reloaded the queues.

Think of it as an RCU grace period. This is simply not something which 
is made to be used constantly, but rather just at process termination.

>> The "preemption" feature is really called suspend and made just for the case
>> when we want to put a process to sleep or need to forcefully kill it for
>> misbehavior or stuff like that. It is not meant to be used in normal
>> operation.
>>
>> If we only attach it on ->move then yeah maybe a last resort possibility to
>> do it this way, but I think in that case we could rather stick with kernel
>> submissions.
> Well this is a hybrid userspace ring + kernel augmeted submit mode, so you
> can keep dma-fences working. Because the dma-fence stuff wont work with
> pure userspace submit, I think that conclusion is rather solid. Once more
> even after this long thread here.

When assisted with unload fences, then yes. Problem is that I can't see 
how we could implement those performant currently.

>>> Also, if userspace lies to us and keeps pushing crap into the ring
>>> after it's supposed to be idle: Userspace is already allowed to waste
>>> gpu time. If you're too worried about this set a fairly aggressive
>>> preempt timeout on the unload fence, and kill the context if it takes
>>> longer than what preempting an idle ring should take (because that
>>> would indicate broken/evil userspace).
>> I think you have the wrong expectation here. It is perfectly valid and
>> expected for userspace to keep writing commands into the ring buffer.
>>
>> After all when one frame is completed they want to immediately start
>> rendering the next one.
> Sure, for the true userspace direct submit model. But with that you don't
> get dma-fence, which means this gpu will not work for 3d accel on any
> current linux desktop.

I'm not sure of that. I've looked a bit into how we could add user 
fences to dma_resv objects and that isn't that hard after all.

> Which sucks, hence some hybrid model of using the userspace ring and
> kernel augmented submit is needed. Which was my idea.

Yeah, I think when our firmware folks would really remove the kernel 
queue and we still don't have

>
>> [SNIP]
>> Can't find that of hand either, but see the amdgpu_noretry module option.
>>
>> It basically tells the hardware if retry page faults should be supported or
>> not because this whole TLB shutdown thing when they are supported is
>> extremely costly.
> Hm so synchronous tlb shootdown is a lot more costly when you allow
> retrying of page faults?

Partially correct, yes.

See when you have retry page faults enabled and unmap something you need 
to make sure that everybody which could have potentially translated that 
page and has a TLB is either invalidated or waited until the access is 
completed.

Since every CU could be using a memory location that takes ages to 
completed compared to the normal invalidation where you just invalidate 
the L1/L2 and are done.

Additional to that the recovery adds some extra overhead to every memory 
access, so even without a fault you are quite a bit slower if this is 
enabled.

> That sounds bad, because for full hmm mode you need to be able to retry
> pagefaults. Well at least the PASID/ATS/IOMMU side will do that, and might just
> hang your gpu for a long time while it's waiting for the va->pa lookup
> response to return. So retrying lookups shouldn't be any different really.
>
> And you also need fairly fast synchronous tlb shootdown for hmm. So if
> your hw has a problem with both together that sounds bad.

Completely agree. And since it was my job to validate the implementation 
on Vega10 I was also the first one to realize that.

Felix, a couple of others and me are trying to work around those 
restrictions ever since.

> I was more thinking about handling it all in the kernel.
> Yeah can do, just means that you also have to copy the ringbuffer stuff
> over from userspace to the kernel.

That is my least worry. The IBs are just addr+length., so no more than 
16 bytes for each IB.

> It also means that there's more differences in how your userspace works
> between full userspace mode (necessary for compute) and legacy dma-fence
> mode (necessary for desktop 3d). Which is especially big fun for vulkan,
> since that will have to do both.

That is the bigger problem.

Christian.

>
> But then amd is still hanging onto the amdgpu vs amdkfd split, so you're
> going for max pain in this area anyway :-P
> -Daniel