<html>
  <head>
    <meta http-equiv="Content-Type" content="text/html; charset=utf-8">
  </head>
  <body text="#000000" bgcolor="#FFFFFF">
    <div class="moz-cite-prefix">Am 14.08.2018 um 17:17 schrieb Andrey
      Grodzovsky:<br>
    </div>
    <blockquote type="cite"
      cite="mid:2f197b16-4c60-6b6a-0b36-7e60b9e5fc33@amd.com">
      <meta http-equiv="Content-Type" content="text/html; charset=utf-8">
      <p>I assume that this is the only code change and no locks are
        taken in drm_sched_entity_push_job - <br>
      </p>
    </blockquote>
    <br>
    What are you talking about? You surely now take looks in
    drm_sched_entity_push_job():<br>
    <blockquote type="cite">+    spin_lock(&entity->rq_lock);<br>
      +    entity->last_user = current->group_leader;<br>
      +    if (list_empty(&entity->list))<br>
    </blockquote>
    <br>
    <blockquote type="cite"
      cite="mid:2f197b16-4c60-6b6a-0b36-7e60b9e5fc33@amd.com">
      <p> </p>
      <p>What happens if process A runs drm_sched_entity_push_job after
        this code was executed from the  (dying) process B and there</p>
      <p>are still jobs in the queue (the wait_event terminated
        prematurely), the entity already removed from rq , but bool
        'first' in drm_sched_entity_push_job</p>
      <p>will return false and so the entity will not be reinserted back
        into rq entity list and no wake up trigger will happen for
        process A pushing a new job.</p>
    </blockquote>
    <br>
    Thought about this as well, but in this case I would say: Shit
    happens!<br>
    <br>
    The dying process did some command submission and because of this
    the entity was killed as well when the process died and that is
    legitimate.<br>
    <br>
    <blockquote type="cite"
      cite="mid:2f197b16-4c60-6b6a-0b36-7e60b9e5fc33@amd.com">
      <p><br>
      </p>
      <p>Another issue bellow - <br>
      </p>
      <p>Andrey<br>
      </p>
      <br>
      <div class="moz-cite-prefix">On 08/14/2018 03:05 AM, Christian
        König wrote:<br>
      </div>
      <blockquote type="cite"
        cite="mid:0fa473f5-155a-223e-fbb6-37147fd47a17@gmail.com">
        <div class="moz-cite-prefix">I would rather like to avoid taking
          the lock in the hot path.<br>
          <br>
          How about this:<br>
          <br>
               /* For killed process disable any more IBs enqueue right
          now */<br>
              last_user = cmpxchg(&entity->last_user,
          current->group_leader, NULL);<br>
               if ((!last_user || last_user == current->group_leader)
          &&<br>
                   (current->flags & PF_EXITING) &&
          (current->exit_code == SIGKILL)) {<br>
                  grab_lock();<br>
                   drm_sched_rq_remove_entity(entity->rq, entity);<br>
                  if (READ_ONCE(&entity->last_user) != NULL)<br>
        </div>
      </blockquote>
      <br>
      This condition is true because just exactly now process A did
      drm_sched_entity_push_job->WRITE_ONCE(entity->last_user,
      current->group_leader);<br>
      and so the line bellow executed and entity reinserted into rq.
      Let's say also that the entity job queue is empty now. For process
      A bool 'first' will be true<br>
      and hence also
      drm_sched_entity_push_job->drm_sched_rq_add_entity(entity->rq,
      entity) will take place causing double insertion of the entity
      queue into rq list.<br>
    </blockquote>
    <br>
    Calling drm_sched_rq_add_entity() is harmless, it is protected
    against double insertion.<br>
    <br>
    But thinking more about it your idea of adding a killed or finished
    flag becomes more and more appealing to have a consistent handling
    here.<br>
    <br>
    Christian.<br>
    <br>
    <blockquote type="cite"
      cite="mid:2f197b16-4c60-6b6a-0b36-7e60b9e5fc33@amd.com"> <br>
      Andrey<br>
      <br>
      <blockquote type="cite"
        cite="mid:0fa473f5-155a-223e-fbb6-37147fd47a17@gmail.com">
        <div class="moz-cite-prefix">            
          drm_sched_rq_add_entity(entity->rq, entity);<br>
                  drop_lock();<br>
              }<br>
           <br>
          Christian.<br>
          <br>
          Am 13.08.2018 um 18:43 schrieb Andrey Grodzovsky:<br>
        </div>
        <blockquote type="cite"
          cite="mid:82109a00-aebf-1e5f-5346-eef541a361df@amd.com">
          <p>Attached. </p>
          <p>If the general idea in the patch is OK I can think of a
            test (and maybe add to libdrm amdgpu tests) to actually
            simulate this scenario with 2 forked</p>
          <p>concurrent processes working on same entity's job queue
            when one is dying while the other keeps pushing to the same
            queue. For now I only tested it</p>
          <p>with normal boot and ruining multiple glxgears concurrently
            - which doesn't really test this code path since i think
            each of them works on it's own FD.<br>
          </p>
          <p>Andrey<br>
          </p>
          <br>
          <div class="moz-cite-prefix">On 08/10/2018 09:27 AM, Christian
            König wrote:<br>
          </div>
          <blockquote type="cite"
            cite="mid:5bf40a54-18f9-98fd-a3df-dd0b8da0a424@gmail.com">
            <div class="moz-cite-prefix">Crap, yeah indeed that needs to
              be protected by some lock.<br>
              <br>
              Going to prepare a patch for that,<br>
              Christian.<br>
              <br>
              Am 09.08.2018 um 21:49 schrieb Andrey Grodzovsky:<br>
            </div>
            <blockquote type="cite"
              cite="mid:54621fc1-7246-f1bf-26bb-a16c4daf249f@amd.com">
              <p>Reviewed-by: Andrey Grodzovsky <a
                  class="moz-txt-link-rfc2396E"
                  href="mailto:andrey.grodzovsky@amd.com"
                  moz-do-not-send="true"><andrey.grodzovsky@amd.com></a></p>
              <p><br>
              </p>
              <p>But I still  have questions about entity->last_user
                (didn't notice this before) - <br>
              </p>
              <p>Looks to me there is a race condition with it's current
                usage, let's say process A was preempted after doing
                drm_sched_entity_flush->cmpxchg(...)</p>
              <p>now process B working on same entity (forked) is inside
                drm_sched_entity_push_job, he writes his PID to
                entity->last_user and also</p>
              <p>executes drm_sched_rq_add_entity. Now process A runs
                again and execute drm_sched_rq_remove_entity
                inadvertently causing process B removal</p>
              <p>from it's scheduler rq.</p>
              <p>Looks to me like instead we should lock together
                entity->last_user accesses and adds/removals of
                entity to the rq.</p>
              <p>Andrey<br>
              </p>
              <br>
              <div class="moz-cite-prefix">On 08/06/2018 10:18 AM, Nayan
                Deshmukh wrote:<br>
              </div>
              <blockquote type="cite"
cite="mid:CAFd4ddzyvHPHepAgs=mjyWVj0WDV_pQbE9x7aHwNZ_zcME6fqQ@mail.gmail.com">
                <div dir="ltr">
                  <div>
                    <div>I forgot about this since we started discussing
                      possible scenarios of processes and threads.<br>
                      <br>
                    </div>
                    In any case, this check is redundant. Acked-by:
                    Nayan Deshmukh <<a
                      href="mailto:nayan26deshmukh@gmail.com"
                      moz-do-not-send="true">nayan26deshmukh@gmail.com</a>><br>
                    <br>
                  </div>
                  Nayan<br>
                </div>
                <br>
                <div class="gmail_quote">
                  <div dir="ltr">On Mon, Aug 6, 2018 at 7:43 PM
                    Christian König <<a
                      href="mailto:ckoenig.leichtzumerken@gmail.com"
                      moz-do-not-send="true">ckoenig.leichtzumerken@gmail.com</a>>
                    wrote:<br>
                  </div>
                  <blockquote class="gmail_quote" style="margin:0 0 0
                    .8ex;border-left:1px #ccc solid;padding-left:1ex">Ping.
                    Any objections to that?<br>
                    <br>
                    Christian.<br>
                    <br>
                    Am 03.08.2018 um 13:08 schrieb Christian König:<br>
                    > That is superflous now.<br>
                    ><br>
                    > Signed-off-by: Christian König <<a
                      href="mailto:christian.koenig@amd.com"
                      target="_blank" moz-do-not-send="true">christian.koenig@amd.com</a>><br>
                    > ---<br>
                    >   drivers/gpu/drm/scheduler/gpu_scheduler.c | 5
                    -----<br>
                    >   1 file changed, 5 deletions(-)<br>
                    ><br>
                    > diff --git
                    a/drivers/gpu/drm/scheduler/gpu_scheduler.c
                    b/drivers/gpu/drm/scheduler/gpu_scheduler.c<br>
                    > index 85908c7f913e..65078dd3c82c 100644<br>
                    > --- a/drivers/gpu/drm/scheduler/gpu_scheduler.c<br>
                    > +++ b/drivers/gpu/drm/scheduler/gpu_scheduler.c<br>
                    > @@ -590,11 +590,6 @@ void
                    drm_sched_entity_push_job(struct drm_sched_job
                    *sched_job,<br>
                    >       if (first) {<br>
                    >               /* Add the entity to the run
                    queue */<br>
                    >             
                     spin_lock(&entity->rq_lock);<br>
                    > -             if (!entity->rq) {<br>
                    > -                     DRM_ERROR("Trying to push
                    to a killed entity\n");<br>
                    > -                   
                     spin_unlock(&entity->rq_lock);<br>
                    > -                     return;<br>
                    > -             }<br>
                    >             
                     drm_sched_rq_add_entity(entity->rq, entity);<br>
                    >             
                     spin_unlock(&entity->rq_lock);<br>
                    >             
                     drm_sched_wakeup(entity->rq->sched);<br>
                    <br>
                  </blockquote>
                </div>
              </blockquote>
              <br>
            </blockquote>
            <br>
          </blockquote>
          <br>
          <br>
          <fieldset class="mimeAttachmentHeader"></fieldset>
          <br>
          <pre wrap="">_______________________________________________
dri-devel mailing list
<a class="moz-txt-link-abbreviated" href="mailto:dri-devel@lists.freedesktop.org" moz-do-not-send="true">dri-devel@lists.freedesktop.org</a>
<a class="moz-txt-link-freetext" href="https://lists.freedesktop.org/mailman/listinfo/dri-devel" moz-do-not-send="true">https://lists.freedesktop.org/mailman/listinfo/dri-devel</a>
</pre>
        </blockquote>
        <br>
      </blockquote>
      <br>
    </blockquote>
    <br>
  </body>
</html>