[Beignet] GROMACS on beignet
Pan, Xiuli
xiuli.pan at intel.com
Mon Apr 11 06:34:30 UTC 2016
Hi Szilárd,
I have ran with the prefetch enable and I found all platform will show the same problem. And the problem is caused by gpu hand, and the hang is caused by the
barrier(CLK_LOCAL_MEM_FENCE);
you added inside the loop. In OpenCL spec it is required that all work-items should all run the barrier. So In your case it seems the barrier will not be executed by all of the work-item by the same times.
You can try to move the barrier outside the loop or make sure all the loop will execute the barrier same times.
Thanks
Xiuli
-----Original Message-----
From: Beignet [mailto:beignet-bounces at lists.freedesktop.org] On Behalf Of Szilárd Páll
Sent: Thursday, April 7, 2016 8:57 PM
To: Pan, Xiuli <xiuli.pan at intel.com>
Cc: beignet at lists.freedesktop.org
Subject: Re: [Beignet] GROMACS on beignet
Hi,
On Wed, Apr 6, 2016 at 5:53 AM, Pan, Xiuli <xiuli.pan at intel.com> wrote:
> Hi Szilard,
>
> The enqueuer seems to be our runtime implementation limitation, we are now refining our runtime.
Thanks for the confirmation! Note that we have CPU-side cycle-counter timers with statistics printed at the end of the log - the "Launch GPU" row should account for time spent in async OpenCL API calls, but here it looks like the total time measured there is larger than the time measured with OpenCL events (bottom table). Hence it seems that there there is no overlap at all between CPU and GPU.
>
> And about the double quote problem we had a patch but not in release_v1.1.1:
> https://cgit.freedesktop.org/beignet/commit/?h=Release_v1.1&id=8e9ef20
> f731d4135fc4866bcf7374c8222e21a25 It will be in release v1.1.2 that
> will soon be released.
Great, thanks!
> I have run the GROMACS on one of our HSW devices and it seems run soomthly and send you the logs in an early email. What else I need to do to reproduce the bugs?
Answered off-list!
Cheers,
--
Szilárd
> Thanks
> Xiuli
> -----Original Message-----
> From: Szilárd Páll [mailto:sin.pecado at gmail.com]
> Sent: Wednesday, April 6, 2016 7:35 AM
> To: Pan, Xiuli <xiuli.pan at intel.com>
> Cc: beignet at lists.freedesktop.org
> Subject: Re: [Beignet] GROMACS on beignet
>
> Hi Xiuli,
>
>
> On Tue, Apr 5, 2016 at 5:09 AM, Pan, Xiuli <xiuli.pan at intel.com> wrote:
>> Hi Szilárd,
>>
>>
>>
>> 1. It seems size is fine.
>
> Ack, thanks for checking.
>
>>
>> 2. The test seems ok. Could you provide your environment for the HSW
>> machine with problem, like kernel versions, drm versions etc.
>
> It's a vanilla Ubuntu 15.10 installation with 4.2.0-34-generic kernel,
> 2.4.64-1 libdrm.
>
>>
>> 3. For clEnqueueWriteBuffer, clEnqueuereadBuffer they are actually I/O
>> control to get map so total blocking on CPU,
>
> Does that an implementation limitation? I thought these can be async with respect to the CPU?
>
>> but for clEnqueueNDRangeKernel
>> you can enqueue it by pass an event to it then it will not have a flush.
>
> Admittedly I'm not sure what you mean here, could you please explain?
> We do pass an event as last argument, but only for timing purposes (by default on, but can be turned off).
>
>>
>>
>> I will try GROMACS on our platforms to see if bugs can be reproduced
>> and we can try to root case the bugs.
>
> Thanks.
>
> --
> Szilard
>
>>
>>
>>
>> Thanks
>>
>> Xiuli
>>
>>
>>
>> From: Szilárd Páll [mailto:sin.pecado at gmail.com]
>> Sent: Friday, April 1, 2016 6:46 PM
>> To: Pan, Xiuli <xiuli.pan at intel.com>
>> Cc: beignet at lists.freedesktop.org
>>
>>
>> Subject: Re: [Beignet] GROMACS on beignet
>>
>>
>>
>> Hi Xiuli,
>>
>>
>>
>> Apologies if I were not clear enough with my questions.
>>
>>
>>
>> 1. By saying that there is only 64K local memory, I assume you meant
>> to hint that this is a scarce resource. We use only ~2KB local memory
>> per work group required mostly for prefetching and reduction across
>> work-group. By disabling pre-fetching we could get this down to a
>> minimum of about 800 bytes. However, do you expect that to help in
>> any way? Is the hardware capable of keeping in flight >30-32 waves of 64 threads?
>>
>>
>>
>> Also, I think I'm lacking some detailed knowledge as I do not see how
>> is this related to the drm_intel_gem_bo_context_exec() issue.
>>
>>
>>
>> 2. As mentioned above, I have local work size = 64 and rely on
>> splitting the work over the global grid (so small workloads will have
>> 100s, large ones 10000s larger global work size).
>>
>>
>>
>> 3. Do you mean that I should *not* expect concurrency between CPU and
>> GPU to be possible with beignet and clEnqueueWriteBuffer,
>> clEnqueuereadBuffer, and clEnqueueNDRangeKernel will exhibit blocking behavior?
>>
>>
>>
>>
>>
>> I have not had time to file a bugzilla yet with reproduction details,
>> sorry about that.
>>
>> The GROMACS source you'll need is in the master branch, plus the
>> https://gerrit.gromacs.org/#/c/5752/2 change under review fixes some
>> execution width assumptions. Other than that you'll need a small
>> patch to enable Intel iGPUs (e.g found here
>> https://bugs.freedesktop.org/show_bug.cgi?id=94265 which is BTW
>> another bug on IVB). You'll also run into the include path issue I
>> mentioned before for which you'll need to activate the Apple workaround, here's a patch:
>>
>>
>>
>> diff --git a/src/gromacs/gpu_utils/gpu_utils_ocl.cpp
>> b/src/gromacs/gpu_utils/gpu_utils_ocl.cpp
>>
>> index 2084d8c..8928582 100644
>>
>> --- a/src/gromacs/gpu_utils/gpu_utils_ocl.cpp
>>
>> +++ b/src/gromacs/gpu_utils/gpu_utils_ocl.cpp
>>
>> @@ -131,6 +131,8 @@ static int is_gmx_supported_gpu_id(struct
>> gmx_device_info_t *ocl_gpu_device)
>>
>> return egpuCompatible;
>>
>> case OCL_VENDOR_AMD:
>>
>> return runningOnCompatibleOSForAmd() ? egpuCompatible :
>> egpuIncompatible;
>>
>> + case OCL_VENDOR_INTEL:
>>
>> + return egpuCompatible;
>>
>> default:
>>
>> return egpuIncompatible;
>>
>> }
>>
>> diff --git a/src/gromacs/gpu_utils/ocl_compiler.cpp
>> b/src/gromacs/gpu_utils/ocl_compiler.cpp
>>
>> index 6a4772a..9aa3c1e 100644
>>
>> --- a/src/gromacs/gpu_utils/ocl_compiler.cpp
>>
>> +++ b/src/gromacs/gpu_utils/ocl_compiler.cpp
>>
>> @@ -747,7 +747,8 @@ ocl_get_build_options_string(cl_context
>> context,
>>
>> * OpenCL implementations are happy with. Since the standard
>> still says
>>
>> * it should be quoted, we handle Apple as a special case.
>>
>> */
>>
>> -#ifdef __APPLE__
>>
>> +//#ifdef __APPLE__
>>
>> +#if 1
>>
>> std::string unescaped_ocl_root_path = get_ocl_root_path();
>>
>> std::string ocl_root_path;
>>
>>
>>
>>
>>
>> Additionally, here's an input file you'll need to be able to start
>> the
>> program:
>>
>> https://www.dropbox.com/s/hm5t90iwo3xw5ws/water-48k-frozen.tpr?dl=0
>>
>> which you can do with the following command:
>>
>> /PATH/gmx mdrun -s water-48k-frozen
>>
>>
>>
>> Let me know if something is unclear.
>>
>>
>>
>> Thanks for the help!
>>
>>
>>
>> Cheers,
>>
>>
>> --
>> Szilárd
>>
>>
>>
>> On Thu, Mar 31, 2016 at 5:34 AM, Pan, Xiuli <xiuli.pan at intel.com> wrote:
>>
>> Hi Szilárd,
>>
>>
>>
>> Since you have some questions and I could not reproduce them here I
>> just make some response that I think may related to this problem:
>>
>> 1. We only have 64K Share local memory for all of the work groups
>>
>> 2. The drm_intel_gem_bo_context_exec() failed have a lot of reasons,
>> could give us the detail about your test about the execution wide?
>>
>> 3. As far as I know most enqueue in beignet default to be
>> blocking(some related to GPU is not blocking) , you can see that api
>> clFlush is actually an empty function.
>>
>>
>>
>> Also I am trying to reproduce your bug here and I am setting up
>> GROMACS. Is there anything I should know to run it with beignet?
>>
>>
>>
>> Thanks
>>
>> Xiuli
>>
>>
>>
>> From: Beignet [mailto:beignet-bounces at lists.freedesktop.org] On
>> Behalf Of Szilárd Páll
>> Sent: Thursday, March 31, 2016 3:14 AM
>> To: beignet at lists.freedesktop.org
>> Subject: Re: [Beignet] GROMACS on beignet
>>
>>
>>
>> Hello again,
>>
>>
>>
>> I have been trying to verify whether there may be assumptions
>> >=32-wide execution hiding in the kernels (in particular in code
>> that's using local memory for prefetching or reduction) and tried
>> dropping in mem fences to test a few things, but at several points I
>> managed to trigger the aforementioned error:
>>
>> drm_intel_gem_bo_context_exec() failed: Input/output error
>>
>>
>>
>> Is this a known issues? There have been reports of it, but perhaps it
>> is just the manifestation of multiple possible issues?
>>
>>
>>
>> Secondly, I do not see the reason why I get blocking behavior of all
>> enqueue operations (and I don't get this on NVIDIA or AMD). Are there
>> any peculiarities I should be aware of?
>>
>>
>>
>> Cheers,
>>
>>
>> --
>> Szilárd
>>
>>
>>
>> On Mon, Mar 28, 2016 at 1:49 AM, Szilárd Páll <sin.pecado at gmail.com> wrote:
>>
>> Hi Xiuli,
>>
>>
>>
>> Thanks for the quick reply!
>>
>>
>>
>> On Fri, Mar 25, 2016 at 4:06 AM, Pan, Xiuli <xiuli.pan at intel.com> wrote:
>>
>> Hi Szilárd,
>>
>>
>>
>> What do you mean about quoted includes?
>>
>>
>>
>> I mean -I"/path/to/headers" does not work, but -I/path/to/headers does.
>>
>>
>>
>> If you mean the include in kernels, I think we may have some problem
>> with that. The *.cl we used for clang actually was a copied tmp
>> version stored not in where is used to be. So I think if you just put
>> what need to be included in the old place, clang could not find it.
>> You could try a workaround to pass “-I where/your/header/is” as a
>> build option to clBuildProgram.
>>
>>
>>
>> Then if you have some double types used on Haswell it may have some problem.
>> The hardware for HSW does not support double very well as we have
>> refined our double support to hardware then, so HSW may have some
>> issues with double type. If it is not the problem with double float,
>> you can send your kernel as an attachment or report a bug on our
>> Bugzilla(https://bugs.freedesktop.org) and we will tried to fix it.
>>
>>
>>
>> No double precision in the kernels.
>>
>>
>>
>> For now I'll post here, I feel like a bug report may be an overkill -
>> especially as I can't provide a full repro case that does not involve
>> building the entire application.
>>
>>
>>
>> I've attached a minimum set of source files that's needed to compile.
>> We have pretty heavy preprocessor use that generates kernels for the
>> different inputs / outputs / computation combinations, so one
>> particular flavor that's known to produce incorrect results is
>> generated compiling nbnxn_ocl_kernels.cl with the following flags:
>>
>>
>>
>> -D_WARPLESS_SOURCE_ -DGMX_OCL_FASTGEN -DEL_RF -DEELNAME=_ElecRF
>> -DLJ_COMB_GEOM -DVDWNAME=_VdwLJCombGeom -DCENTRAL=22
>> -DNBNXN_GPU_NCLUSTER_PER_SUPERCLUSTER=8 -DNBNXN_GPU_CLUSTER_SIZE=8
>> -DNBNXN_GPU_JGROUP_SIZE=4 -DNBNXN_AVOID_SING_R2_INC=1.0e-12f
>>
>>
>>
>>
>>
>> Additionally I had a closer look and so far I have observed three
>> issues (additional to the minor include issue mentione before):
>>
>>
>>
>> 1. If I do a manual prefetch into local memory followed by a mem
>> fence (seenbnxn_ocl_kernel_nowarp.clh line 339), I get the following error:
>>
>> drm_intel_gem_bo_context_exec() failed: Input/output error
>>
>> The next kernel call then fails with CL_OUT_OF_RESOURCES.
>>
>> Without the manual prefetch it works better, but...
>>
>>
>>
>> 2. The results produced by the kernel are still somewhat off. It
>> could be that I missed a subtle detail and the kernels still do not
>> conform to the hardware's execution model. I'm very familar with
>> Intel's hardware and these kernels were originally designed for 32/64 wide execution.
>>
>>
>>
>> 3. All task enqueue calls seem to be blocking.
>>
>>
>>
>>
>>
>> Thanks & Cheers,
>>
>> --
>>
>> Szilárd
>>
>>
>>
>>
>>
>> Thanks
>>
>> Xiuli
>>
>>
>>
>> From: Beignet [mailto:beignet-bounces at lists.freedesktop.org] On
>> Behalf Of Szilárd Páll
>> Sent: Friday, March 25, 2016 7:16 AM
>> To: beignet at lists.freedesktop.org
>> Subject: [Beignet] GROMACS on beignet
>>
>>
>>
>> Hi,
>>
>>
>>
>> I am a developer of the GROMACS (www.gromacs.org) molecular dynamics
>> simulation package. We have OpenCL offload for some of the
>> compute-intensive kernels which that works very well on AMD. I wanted
>> to assess how feasible is to use an Intel iGPU in GROMACS and after
>> jumping through some hoops I got a 4.2 kernel and beignet master installed.
>>
>>
>>
>> Then I ran into the first minor issue: it seems that beignet does not
>> accept quoted includes although AFAIK the double-quoted include paths
>> should be accepted, but that did not work. No big deal, it doesn't
>> work with Apple's OpenCL either, but I thought I'd ask.
>>
>>
>>
>> However, the bigger issue is that running on Haswell (HD 4600, I
>> think) the kernel produces results that are very off (while the very
>> same source gives correct results on other platforms). I've not much
>> time to dig deeper, but I thought I'd drop a mail maybe somebody is
>> interested in helping out with tips or even tracking down where the issue is.
>>
>>
>>
>> Suggestions would be welcome!
>>
>>
>>
>> Cheers,
>>
>> --
>> Szilárd
>>
>>
>>
>>
>>
>>
_______________________________________________
Beignet mailing list
Beignet at lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/beignet
More information about the Beignet
mailing list