[Intel-gfx] signal: break out of wait loops on kthread_stop()

Jason A. Donenfeld Jason at zx2c4.com
Wed Oct 19 18:16:22 UTC 2022


On Wed, Oct 19, 2022 at 06:57:38PM +0100, Tvrtko Ursulin wrote:
> 
> On 19/10/2022 17:00, Jason A. Donenfeld wrote:
> > On Wed, Oct 19, 2022 at 7:31 AM Tvrtko Ursulin
> > <tvrtko.ursulin at linux.intel.com> wrote:
> >>
> >>
> >> Hi,
> >>
> >> A question regarding a7c01fa93aeb ("signal: break out of wait loops on
> >> kthread_stop()") if I may.
> >>
> >> We have a bunch code in i915, possibly limited to self tests (ie debug
> >> builds) but still important for our flows, which spawn kernel threads
> >> and exercises parts of the driver.
> >>
> >> Problem we are hitting with this patch is that code did not really need
> >> to be signal aware until now. Well to say that more accurately - we were
> >> able to test the code which is normally executed from userspace, so is
> >> signal aware, but not worry about -ERESTARTSYS or -EINTR within the test
> >> cases itself.
> >>
> >> For example threads which exercise an internal API for a while until the
> >> parent calls kthread_stop. Now those tests can hit unexpected errors.
> >>
> >> Question is how to best approach working around this change. It is of
> >> course technically possible to rework our code in more than one way,
> >> although with some cost and impact already felt due reduced pass rates
> >> in our automated test suites.
> >>
> >> Maybe an opt out kthread flag from this new behavior? Would that be
> >> acceptable as a quick fix? Or any other comments?
> > 
> > You can opt out by running `clear_tsk_thread_flag(current,
> > TIF_NOTIFY_SIGNAL);` at the top of your kthread. But you should really
> > fix your code instead. Were I your reviewer, I wouldn't merge code
> > that took the lazy path like that. However, that should work, if you
> > do opt for the quick fix.
>
> Also, are you confident that the change will not catch anyone else by 
> surprise? In the original thread I did not spot any concerns about the 
> kthreads being generally unprepared to start receiving EINTR/ERESTARTSYS 
> from random call chains.

Pretty sure, yea. i915 is unique in its abuse of the API. Keep in mind
that kthread_stop() also sets KTHREAD_SHOULD_STOP and such. Your code is
abusing the API by calling kthread_run() followed by kthread_stop().

As evidence of how broken your code actually is, the kthread_stop()
function has a comment that makes it clear, "This can also be called
after kthread_create() instead of calling wake_up_process(): the thread
will exit without calling threadfn()," yet i915 has attempted to hack
around it with ridiculous yields and msleeps. For example:

                threads[n] = kthread_run(__igt_breadcrumbs_smoketest,
                                         &t, "igt/%d", n);
...

        yield(); /* start all threads before we begin */
        msleep(jiffies_to_msecs(i915_selftest.timeout_jiffies));
...
                err = kthread_stop(threads[n]);


Or here's another one:

                tsk = kthread_run(fn, &thread[i], "igt-%d", i);
...
        msleep(10); /* start all threads before we kthread_stop() */
...
                status = kthread_stop(tsk);

I mean come on.

This is brittle and bad and kind of ridiculous that it shipped this way.
Now you're asking to extend your brittleness, so that you can avoid the
work of cleaning up 5 call sites. Just clean up those 5 call sites. It's
only 5, as far as I can tell.


> Right, but our hand is a bit forced at the moment. Since 6.1-rc1 has 
> propagated to our development tree on Monday, our automated testing 
> started failing significantly, which prevents us merging new work until 
> resolved. So a quick fix trumps the ideal road in the short term. Just 
> because it is quick.

"Short term" -- somehow I can imagine the short term hack will turn into
a long term one.

Anyway, what I suspect you might actually want as a bandaid is a
"kthread_wait()"-like function, that doesn't try to stop the thread with
KTHREAD_SHOULD_STOP and such, but just waits for the completion:

diff --git a/include/linux/kthread.h b/include/linux/kthread.h
index 30e5bec81d2b..2699cc45ad15 100644
--- a/include/linux/kthread.h
+++ b/include/linux/kthread.h
@@ -86,6 +86,7 @@ void free_kthread_struct(struct task_struct *k);
 void kthread_bind(struct task_struct *k, unsigned int cpu);
 void kthread_bind_mask(struct task_struct *k, const struct cpumask *mask);
 int kthread_stop(struct task_struct *k);
+int kthread_wait(struct task_struct *k);
 bool kthread_should_stop(void);
 bool kthread_should_park(void);
 bool __kthread_should_park(struct task_struct *k);
diff --git a/kernel/kthread.c b/kernel/kthread.c
index f97fd01a2932..d581d78a3a26 100644
--- a/kernel/kthread.c
+++ b/kernel/kthread.c
@@ -715,6 +715,22 @@ int kthread_stop(struct task_struct *k)
 }
 EXPORT_SYMBOL(kthread_stop);

+int kthread_wait(struct task_struct *k)
+{
+	struct kthread *kthread;
+	int ret;
+
+	get_task_struct(k);
+	kthread = to_kthread(k);
+	wake_up_process(k);
+	wait_for_completion(&kthread->exited);
+	ret = kthread->result;
+	put_task_struct(k);
+
+	return ret;
+}
+EXPORT_SYMBOL(kthread_stop);
+
 int kthreadd(void *unused)
 {
 	struct task_struct *tsk = current;



More information about the Intel-gfx mailing list