<html>
  <head>
    <meta http-equiv="Content-Type" content="text/html; charset=utf-8">
  </head>
  <body text="#000000" bgcolor="#FFFFFF">
    <p><br>
    </p>
    <br>
    <div class="moz-cite-prefix">On 08/14/2018 11:26 AM, Christian König
      wrote:<br>
    </div>
    <blockquote type="cite"
      cite="mid:a71be8e0-5181-1643-8c91-c8d619e60c7e@amd.com">
      <meta http-equiv="Content-Type" content="text/html; charset=utf-8">
      <div class="moz-cite-prefix">Am 14.08.2018 um 17:17 schrieb Andrey
        Grodzovsky:<br>
      </div>
      <blockquote type="cite"
        cite="mid:2f197b16-4c60-6b6a-0b36-7e60b9e5fc33@amd.com">
        <p>I assume that this is the only code change and no locks are
          taken in drm_sched_entity_push_job - <br>
        </p>
      </blockquote>
      <br>
      What are you talking about? You surely now take looks in
      drm_sched_entity_push_job():<br>
      <blockquote type="cite">+    spin_lock(&entity->rq_lock);<br>
        +    entity->last_user = current->group_leader;<br>
        +    if (list_empty(&entity->list))<br>
      </blockquote>
    </blockquote>
    <br>
    Oh, so your code in drm_sched_entity_flush still relies on my code
    in drm_sched_entity_push_job, OK.<br>
    <br>
    <blockquote type="cite"
      cite="mid:a71be8e0-5181-1643-8c91-c8d619e60c7e@amd.com">
      <blockquote type="cite"> </blockquote>
      <br>
      <blockquote type="cite"
        cite="mid:2f197b16-4c60-6b6a-0b36-7e60b9e5fc33@amd.com">
        <p> </p>
        <p>What happens if process A runs drm_sched_entity_push_job
          after this code was executed from the  (dying) process B and
          there</p>
        <p>are still jobs in the queue (the wait_event terminated
          prematurely), the entity already removed from rq , but bool
          'first' in drm_sched_entity_push_job</p>
        <p>will return false and so the entity will not be reinserted
          back into rq entity list and no wake up trigger will happen
          for process A pushing a new job.</p>
      </blockquote>
      <br>
      Thought about this as well, but in this case I would say: Shit
      happens!<br>
      <br>
      The dying process did some command submission and because of this
      the entity was killed as well when the process died and that is
      legitimate.<br>
      <br>
      <blockquote type="cite"
        cite="mid:2f197b16-4c60-6b6a-0b36-7e60b9e5fc33@amd.com">
        <p><br>
        </p>
        <p>Another issue bellow - <br>
        </p>
        <p>Andrey<br>
        </p>
        <br>
        <div class="moz-cite-prefix">On 08/14/2018 03:05 AM, Christian
          König wrote:<br>
        </div>
        <blockquote type="cite"
          cite="mid:0fa473f5-155a-223e-fbb6-37147fd47a17@gmail.com">
          <div class="moz-cite-prefix">I would rather like to avoid
            taking the lock in the hot path.<br>
            <br>
            How about this:<br>
            <br>
                 /* For killed process disable any more IBs enqueue
            right now */<br>
                last_user = cmpxchg(&entity->last_user,
            current->group_leader, NULL);<br>
                 if ((!last_user || last_user ==
            current->group_leader) &&<br>
                     (current->flags & PF_EXITING) &&
            (current->exit_code == SIGKILL)) {<br>
                    grab_lock();<br>
                     drm_sched_rq_remove_entity(entity->rq, entity);<br>
                    if (READ_ONCE(&entity->last_user) != NULL)<br>
          </div>
        </blockquote>
        <br>
        This condition is true because just exactly now process A did
        drm_sched_entity_push_job->WRITE_ONCE(entity->last_user,
        current->group_leader);<br>
        and so the line bellow executed and entity reinserted into rq.
        Let's say also that the entity job queue is empty now. For
        process A bool 'first' will be true<br>
        and hence also
        drm_sched_entity_push_job->drm_sched_rq_add_entity(entity->rq,
        entity) will take place causing double insertion of the entity
        queue into rq list.<br>
      </blockquote>
      <br>
      Calling drm_sched_rq_add_entity() is harmless, it is protected
      against double insertion.<br>
    </blockquote>
    <br>
    Missed that one, right...<br>
    <blockquote type="cite"
      cite="mid:a71be8e0-5181-1643-8c91-c8d619e60c7e@amd.com"> <br>
      But thinking more about it your idea of adding a killed or
      finished flag becomes more and more appealing to have a consistent
      handling here.<br>
      <br>
      Christian.<br>
    </blockquote>
    <br>
    So to be clear - you would like something like <br>
    <br>
    Removing entity->last_user and adding a 'stopped' flag to
    drm_sched_entity to be set in drm_sched_entity_flush and in
    <br>
    <br>
    drm_sched_entity_push_job check for  'if (entity->stopped)' and
    when true just return some error back to user instead of pushing the
    job ? <br>
    <br>
    Andrey<br>
    <br>
    <blockquote type="cite"
      cite="mid:a71be8e0-5181-1643-8c91-c8d619e60c7e@amd.com"> <br>
      <blockquote type="cite"
        cite="mid:2f197b16-4c60-6b6a-0b36-7e60b9e5fc33@amd.com"> <br>
        Andrey<br>
        <br>
        <blockquote type="cite"
          cite="mid:0fa473f5-155a-223e-fbb6-37147fd47a17@gmail.com">
          <div class="moz-cite-prefix">            
            drm_sched_rq_add_entity(entity->rq, entity);<br>
                    drop_lock();<br>
                }<br>
             <br>
            Christian.<br>
            <br>
            Am 13.08.2018 um 18:43 schrieb Andrey Grodzovsky:<br>
          </div>
          <blockquote type="cite"
            cite="mid:82109a00-aebf-1e5f-5346-eef541a361df@amd.com">
            <p>Attached. </p>
            <p>If the general idea in the patch is OK I can think of a
              test (and maybe add to libdrm amdgpu tests) to actually
              simulate this scenario with 2 forked</p>
            <p>concurrent processes working on same entity's job queue
              when one is dying while the other keeps pushing to the
              same queue. For now I only tested it</p>
            <p>with normal boot and ruining multiple glxgears
              concurrently - which doesn't really test this code path
              since i think each of them works on it's own FD.<br>
            </p>
            <p>Andrey<br>
            </p>
            <br>
            <div class="moz-cite-prefix">On 08/10/2018 09:27 AM,
              Christian König wrote:<br>
            </div>
            <blockquote type="cite"
              cite="mid:5bf40a54-18f9-98fd-a3df-dd0b8da0a424@gmail.com">
              <div class="moz-cite-prefix">Crap, yeah indeed that needs
                to be protected by some lock.<br>
                <br>
                Going to prepare a patch for that,<br>
                Christian.<br>
                <br>
                Am 09.08.2018 um 21:49 schrieb Andrey Grodzovsky:<br>
              </div>
              <blockquote type="cite"
                cite="mid:54621fc1-7246-f1bf-26bb-a16c4daf249f@amd.com">
                <p>Reviewed-by: Andrey Grodzovsky <a
                    class="moz-txt-link-rfc2396E"
                    href="mailto:andrey.grodzovsky@amd.com"
                    moz-do-not-send="true"><andrey.grodzovsky@amd.com></a></p>
                <p><br>
                </p>
                <p>But I still  have questions about
                  entity->last_user (didn't notice this before) - <br>
                </p>
                <p>Looks to me there is a race condition with it's
                  current usage, let's say process A was preempted after
                  doing drm_sched_entity_flush->cmpxchg(...)</p>
                <p>now process B working on same entity (forked) is
                  inside drm_sched_entity_push_job, he writes his PID to
                  entity->last_user and also</p>
                <p>executes drm_sched_rq_add_entity. Now process A runs
                  again and execute drm_sched_rq_remove_entity
                  inadvertently causing process B removal</p>
                <p>from it's scheduler rq.</p>
                <p>Looks to me like instead we should lock together
                  entity->last_user accesses and adds/removals of
                  entity to the rq.</p>
                <p>Andrey<br>
                </p>
                <br>
                <div class="moz-cite-prefix">On 08/06/2018 10:18 AM,
                  Nayan Deshmukh wrote:<br>
                </div>
                <blockquote type="cite"
cite="mid:CAFd4ddzyvHPHepAgs=mjyWVj0WDV_pQbE9x7aHwNZ_zcME6fqQ@mail.gmail.com">
                  <div dir="ltr">
                    <div>
                      <div>I forgot about this since we started
                        discussing possible scenarios of processes and
                        threads.<br>
                        <br>
                      </div>
                      In any case, this check is redundant. Acked-by:
                      Nayan Deshmukh <<a
                        href="mailto:nayan26deshmukh@gmail.com"
                        moz-do-not-send="true">nayan26deshmukh@gmail.com</a>><br>
                      <br>
                    </div>
                    Nayan<br>
                  </div>
                  <br>
                  <div class="gmail_quote">
                    <div dir="ltr">On Mon, Aug 6, 2018 at 7:43 PM
                      Christian König <<a
                        href="mailto:ckoenig.leichtzumerken@gmail.com"
                        moz-do-not-send="true">ckoenig.leichtzumerken@gmail.com</a>>
                      wrote:<br>
                    </div>
                    <blockquote class="gmail_quote" style="margin:0 0 0
                      .8ex;border-left:1px #ccc solid;padding-left:1ex">Ping.
                      Any objections to that?<br>
                      <br>
                      Christian.<br>
                      <br>
                      Am 03.08.2018 um 13:08 schrieb Christian König:<br>
                      > That is superflous now.<br>
                      ><br>
                      > Signed-off-by: Christian König <<a
                        href="mailto:christian.koenig@amd.com"
                        target="_blank" moz-do-not-send="true">christian.koenig@amd.com</a>><br>
                      > ---<br>
                      >   drivers/gpu/drm/scheduler/gpu_scheduler.c |
                      5 -----<br>
                      >   1 file changed, 5 deletions(-)<br>
                      ><br>
                      > diff --git
                      a/drivers/gpu/drm/scheduler/gpu_scheduler.c
                      b/drivers/gpu/drm/scheduler/gpu_scheduler.c<br>
                      > index 85908c7f913e..65078dd3c82c 100644<br>
                      > ---
                      a/drivers/gpu/drm/scheduler/gpu_scheduler.c<br>
                      > +++
                      b/drivers/gpu/drm/scheduler/gpu_scheduler.c<br>
                      > @@ -590,11 +590,6 @@ void
                      drm_sched_entity_push_job(struct drm_sched_job
                      *sched_job,<br>
                      >       if (first) {<br>
                      >               /* Add the entity to the run
                      queue */<br>
                      >             
                       spin_lock(&entity->rq_lock);<br>
                      > -             if (!entity->rq) {<br>
                      > -                     DRM_ERROR("Trying to
                      push to a killed entity\n");<br>
                      > -                   
                       spin_unlock(&entity->rq_lock);<br>
                      > -                     return;<br>
                      > -             }<br>
                      >             
                       drm_sched_rq_add_entity(entity->rq, entity);<br>
                      >             
                       spin_unlock(&entity->rq_lock);<br>
                      >             
                       drm_sched_wakeup(entity->rq->sched);<br>
                      <br>
                    </blockquote>
                  </div>
                </blockquote>
                <br>
              </blockquote>
              <br>
            </blockquote>
            <br>
            <br>
            <fieldset class="mimeAttachmentHeader"></fieldset>
            <br>
            <pre wrap="">_______________________________________________
dri-devel mailing list
<a class="moz-txt-link-abbreviated" href="mailto:dri-devel@lists.freedesktop.org" moz-do-not-send="true">dri-devel@lists.freedesktop.org</a>
<a class="moz-txt-link-freetext" href="https://lists.freedesktop.org/mailman/listinfo/dri-devel" moz-do-not-send="true">https://lists.freedesktop.org/mailman/listinfo/dri-devel</a>
</pre>
          </blockquote>
          <br>
        </blockquote>
        <br>
      </blockquote>
      <br>
    </blockquote>
    <br>
  </body>
</html>