[Intel-gfx] [RFC PATCH 00/97] Basic GuC submission support in the i915

Thu Jun 3 10:52:05 UTC 2021

On 03/06/2021 04:41, Matthew Brost wrote:
> On Wed, Jun 02, 2021 at 08:57:02PM +0200, Daniel Vetter wrote:
>> On Wed, Jun 2, 2021 at 5:27 PM Tvrtko Ursulin
>> <tvrtko.ursulin at linux.intel.com> wrote:
>>> On 25/05/2021 17:45, Matthew Brost wrote:
>>>> On Tue, May 25, 2021 at 11:32:26AM +0100, Tvrtko Ursulin wrote:
>>>>>    * Context pinning code with it's magical two adds, subtract and cmpxchg is
>>>>> dodgy as well.
>>>>
>>>> Daniele tried to remove this and it proved quite difficult + created
>>>> even more races in the backend code. This was prior to the pre-pin and
>>>> post-unpin code which makes this even more difficult to fix as I believe
>>>> these functions would need to be removed first. Not saying we can't
>>>> revisit this someday but I personally really like it - it is a clever
>>>> way to avoid reentering the pin / unpin code while asynchronous things
>>>> are happening rather than some complex locking scheme. Lastly, this code
>>>> has proved incredibly stable as I don't think we've had to fix a single
>>>> thing in this area since we've been using this code internally.
>>>
>>> Pretty much same as above. The code like:
>>>
>>> static inline void __intel_context_unpin(struct intel_context *ce)
>>> {
>>>          if (!ce->ops->sched_disable) {
>>>                  __intel_context_do_unpin(ce, 1);
>>>          } else {
>>>                  while (!atomic_add_unless(&ce->pin_count, -1, 1)) {
>>>                          if (atomic_cmpxchg(&ce->pin_count, 1, 2) == 1) {
>>>                                  ce->ops->sched_disable(ce);
>>>                                  break;
>>>                          }
>>>                  }
>>>          }
>>> }
>>>
>>> That's pretty much impenetrable for me and the only thing I can think of
>>> here is **ALARM** must be broken! See what others think..
> 
> Yea, probably should add a comment:
> 
> /*
>   * If the context has the sched_disable function, it isn't safe to unpin
>   * until this function completes. This function is allowed to complete
>   * asynchronously too. To avoid this function from being entered twice
>   * and move ownership of the unpin to this function's completion, adjust
>   * the pin count to 2 before it is entered. When this function completes
>   * the context can call intel_context_sched_unpin which decrements the
>   * pin count by 2 potentially resulting in an unpin.
>   *
>   * A while loop is needed to ensure the atomicity of the pin count. e.g.
>   * The below if / else statement has a race:
>   *
>   * if (atomic_cmpxchg(&ce->pin_count, 1, 2) == 1)
>   * 	ce->ops->sched_disable(ce);
>   * else
>   * 	atomic_dec(ce, 1);
>   *
>   * Two threads could simultaneously fail the if clause resulting in the
>   * pin_count going to 0 with scheduling enabled + the context pinned.
>   */

I have many questions here..

How time bound is the busy loop?

In guc_context_sched_disable the case where someone pins after the magic 
2 has been set is handled.

But what is pin_count got to 2 legitimately, via the unpin and pin 
between the atomic_cmpxchg in __intel_context_unpin and relevant lines 
in guc_context_sched_disable get to execute?

Why is the pin_count dec in guc_context_sched_disable under the 
ce->guc_state.lock?

What is the point of:

	enabled = context_enabled(ce);
	if (unlikely(!enabled || submission_disabled(guc))) {
		if (!enabled)
			clr_context_enabled(ce);

Reads like clearing the enabled bit if it is not set?!

Why is:

static inline void clr_context_enabled(struct intel_context *ce)
{
	atomic_and((u32)~SCHED_STATE_NO_LOCK_ENABLED,
		   &ce->guc_sched_state_no_lock);
}

Operating on a field called "guc_sched_state_no_lock" (no lock!) while 
the caller is holding guc_state.lock while manipulating that lock.

Regards,

Tvrtko