<div dir="auto">Do you have any work actually going into multiple pipes? My understanding is that opencl will only use one queue at a time (but I'm not really certain about that).<div dir="auto"> </div><div dir="auto">What you can also check is if the app works correctly when it executed on pipe0, and if it hangs on pipe 1+. I removed all the locations where pipe0 was hardcoded in the open driver, but it is possible it is still hardcoded somewhere on the closed stack.</div><div dir="auto"> </div><div dir="auto">Regards,</div><div dir="auto">Andres </div></div><div class="gmail_extra"> <div class="gmail_quote">On Nov 6, 2017 10:19 PM, "Zhou, David(ChunMing)" <<a href="mailto:David1.Zhou@amd.com">David1.Zhou@amd.com</a>> wrote: <blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"> <div dir="ltr"> <div id="m_-7809043402888492486divtagdefaultwrapper" style="font-size:12pt;color:#000000;font-family:Calibri,Helvetica,sans-serif" dir="ltr"> Then snychronization should have no problem, it maybe relate to multipipe hw setting issue. Regards, David Zhou </div> <hr style="display:inline-block;width:98%"> <div id="m_-7809043402888492486divRplyFwdMsg" dir="ltr">From: Andres Rodriguez <<a href="mailto:andresx7@gmail.com" target="_blank">andresx7@gmail.com</a>> Sent: Tuesday, November 7, 2017 2:00:57 AM To: Zhou, David(ChunMing); amd-gfx list Cc: Deucher, Alexander Subject: Re: [PATCH 1/2] drm/amdgpu: use multipipe compute policy on non PL11 asics <div> </div> </div> <div class="m_-7809043402888492486BodyFragment"> <div class="m_-7809043402888492486PlainText">Sorry my mail client seems to have blown up. My reply got cut off, here is the full version: On 2017-11-06 01:49 AM, Chunming Zhou wrote: > Hi Andres, > Hi David, > With your this patch, OCLperf hung. Is this on all ASICs or just a specific one? > > Could you explain more? > > If I am correctly, the difference of with and without this patch is > setting first two queue or setting all queues of pipe0 to queue_bitmap. It is slightly different. With this patch we will also use the first two queues of all pipes, not just pipe 0; Pre-patch: |-Pipe 0-||-Pipe 1-||-Pipe 2-||-Pipe 3-| 11111111 00000000 00000000 00000000 Post-patch: |-Pipe 0-||-Pipe 1-||-Pipe 2-||-Pipe 3-| 11000000 11000000 11000000 11000000 What this means is that we are allowing real multithreading for compute. Jobs on different pipes allow for parallel execution of work. Jobs on the same pipe (but different queues) use timeslicing to share the hardware. > > Then UMD can use different number queue to submit command for compute > selected by amdgpu_queue_mgr_map. > > I checked amdgpu_queue_mgr_map implementation, CS_IOCTL can map user > ring to different hw ring depending on busy or idle, right? Yes, when a queue is first used, amdgpu_queue_mgr_map will decide what the mapping is for a usermode ring to a kernel ring id. > If yes, I see a bug in it, which will result in our sched_fence not > work. Our sched fence assumes the job will be executed in order, your > mapping queue breaks this. I think here you mean that work will execute out of order because it will go to different rings? That should not happen, since the id mapping is permanent on a per-context basis. Once a mapping is decided, it will be cached for this context so that we keep execution order guarantees. See the id-caching code in amdgpu_queue_mgr.c for reference. As long as the usermode keeps submitting work to the same ring, it will all be executed in order (all in the same ring). There is no change in this guarantee compared to pre-patch. Note that even before this patch amdgpu_queue_mgr_map has been using an LRU policy for a long time now. Regards, Andres On Mon, Nov 6, 2017 at 12:44 PM, Andres Rodriguez <<a href="mailto:andresx7@gmail.com" target="_blank">andresx7@gmail.com</a>> wrote: > > > On 2017-11-06 01:49 AM, Chunming Zhou wrote: >> >> Hi Andres, >> > > Hi David, > >> With your this patch, OCLperf hung. > > > Is this on all ASICs or just a specific one? > >> >> Could you explain more? >> >> If I am correctly, the difference of with and without this patch is >> setting first two queue or setting all queues of pipe0 to queue_bitmap. > > > It is slightly different. With this patch we will also use the first two > queues of all pipes, not just pipe 0; > > Pre-patch: > > |-Pipe 0-||-Pipe 1-||-Pipe 2-||-Pipe 3-| > 11111111 00000000 00000000 00000000 > > Post-patch: > > |-Pipe 0-||-Pipe 1-||-Pipe 2-||-Pipe 3-| > 11000000 11000000 11000000 11000000 > > What this means is that we are allowing real multithreading for compute. > Jobs on different pipes allow for parallel execution of work. Jobs on the > same pipe (but different queues) use timeslicing to share the hardware. > > > >> >> Then UMD can use different number queue to submit command for compute >> selected by amdgpu_queue_mgr_map. >> >> I checked amdgpu_queue_mgr_map implementation, CS_IOCTL can map user ring >> to different hw ring depending on busy or idle, right? >> >> If yes, I see a bug in it, which will result in our sched_fence not work. >> Our sched fence assumes the job will be executed in order, your mapping >> queue breaks this. >> >> >> Regards, >> >> David Zhou >> >> >> On 2017年09月27日 00:22, Andres Rodriguez wrote: >>> >>> A performance regression for OpenCL tests on Polaris11 had this feature >>> disabled for all asics. >>> >>> Instead, disable it selectively on the affected asics. >>> >>> Signed-off-by: Andres Rodriguez <<a href="mailto:andresx7@gmail.com" target="_blank">andresx7@gmail.com</a>> >>> --- >>> drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c | 14 ++++++++++++-- >>> 1 file changed, 12 insertions(+), 2 deletions(-) >>> >>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c >>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c >>> index 4f6c68f..3d76e76 100644 >>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c >>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c >>> @@ -109,9 +109,20 @@ void amdgpu_gfx_parse_disable_cu(unsigned *mask, >>> unsigned max_se, unsigned max_s >>> } >>> } >>> +static bool amdgpu_gfx_is_multipipe_capable(struct amdgpu_device *adev) >>> +{ >>> + /* FIXME: spreading the queues across pipes causes perf regressions >>> + * on POLARIS11 compute workloads */ >>> + if (adev->asic_type == CHIP_POLARIS11) >>> + return false; >>> + >>> + return adev->gfx.mec.num_mec > 1; >>> +} >>> + >>> void amdgpu_gfx_compute_queue_acquire(struct amdgpu_device *adev) >>> { >>> int i, queue, pipe, mec; >>> + bool multipipe_policy = amdgpu_gfx_is_multipipe_capable(adev); >>> /* policy for amdgpu compute queue ownership */ >>> for (i = 0; i < AMDGPU_MAX_COMPUTE_QUEUES; ++i) { >>> @@ -125,8 +136,7 @@ void amdgpu_gfx_compute_queue_acquire(struct >>> amdgpu_device *adev) >>> if (mec >= adev->gfx.mec.num_mec) >>> break; >>> - /* FIXME: spreading the queues across pipes causes perf >>> regressions */ >>> - if (0) { >>> + if (multipipe_policy) { >>> /* policy: amdgpu owns the first two queues of the first >>> MEC */ >>> if (mec == 0 && queue < 2) >>> set_bit(i, adev->gfx.mec.queue_bitmap); >> >> > </div> </div> </div> </blockquote></div></div>