[Beignet] GROMACS on beignet

Fri Apr 1 10:45:31 UTC 2016

Hi Xiuli,

Apologies if I were not clear enough with my questions.

1. By saying that there is only 64K local memory, I assume you meant to
hint that this is a scarce resource. We use only ~2KB local memory per work
group required mostly for prefetching and reduction across work-group. By
disabling pre-fetching we could get this down to a minimum of about 800
bytes. However, do you expect that to help in any way? Is the hardware
capable of keeping in flight >30-32 waves of 64 threads?

Also, I think I'm lacking some detailed knowledge as I do not see how is
this related to the drm_intel_gem_bo_context_exec() issue.

2. As mentioned above, I have local work size = 64 and rely on splitting
the work over the global grid (so small workloads will have 100s, large
ones 10000s larger global work size).

3. Do you mean that I should *not* expect concurrency between CPU and GPU
to be possible with beignet and clEnqueueWriteBuffer, clEnqueuereadBuffer,
and clEnqueueNDRangeKernel will exhibit blocking behavior?


I have not had time to file a bugzilla yet with reproduction details, sorry
about that.
The GROMACS source you'll need is in the master branch, plus the
https://gerrit.gromacs.org/#/c/5752/2 change under review fixes some
execution width assumptions. Other than that you'll need a small patch to
enable Intel iGPUs (e.g found here
https://bugs.freedesktop.org/show_bug.cgi?id=94265 which is BTW another bug
on IVB). You'll also run into the include path issue I mentioned before for
which you'll need to activate the Apple workaround, here's a patch:

diff --git a/src/gromacs/gpu_utils/gpu_utils_ocl.cpp
b/src/gromacs/gpu_utils/gpu_utils_ocl.cpp
index 2084d8c..8928582 100644
--- a/src/gromacs/gpu_utils/gpu_utils_ocl.cpp
+++ b/src/gromacs/gpu_utils/gpu_utils_ocl.cpp
@@ -131,6 +131,8 @@ static int is_gmx_supported_gpu_id(struct
gmx_device_info_t *ocl_gpu_device)
             return egpuCompatible;
         case OCL_VENDOR_AMD:
             return runningOnCompatibleOSForAmd() ? egpuCompatible :
egpuIncompatible;
+        case OCL_VENDOR_INTEL:
+            return egpuCompatible;
         default:
             return egpuIncompatible;
     }
diff --git a/src/gromacs/gpu_utils/ocl_compiler.cpp
b/src/gromacs/gpu_utils/ocl_compiler.cpp
index 6a4772a..9aa3c1e 100644
--- a/src/gromacs/gpu_utils/ocl_compiler.cpp
+++ b/src/gromacs/gpu_utils/ocl_compiler.cpp
@@ -747,7 +747,8 @@ ocl_get_build_options_string(cl_context
context,
          * OpenCL implementations are happy with. Since the standard still
says
          * it should be quoted, we handle Apple as a special case.
          */
-#ifdef __APPLE__
+//#ifdef __APPLE__
+#if 1
         std::string unescaped_ocl_root_path = get_ocl_root_path();
         std::string ocl_root_path;


Additionally, here's an input file you'll need to be able to start the
program:
https://www.dropbox.com/s/hm5t90iwo3xw5ws/water-48k-frozen.tpr?dl=0
which you can do with the following command:
/PATH/gmx mdrun -s water-48k-frozen

Let me know if something is unclear.

Thanks for the help!

Cheers,

--
Szilárd

On Thu, Mar 31, 2016 at 5:34 AM, Pan, Xiuli <xiuli.pan at intel.com> wrote:

> Hi Szilárd,
>
>
>
> Since you have some questions and I could not reproduce them here I just
> make some response that I think may related to this problem:
>
> 1.      We only have 64K Share local memory for all of the work groups
>
> 2.      The drm_intel_gem_bo_context_exec() failed have a lot of reasons,
> could give us the detail about your test about the execution wide?
>
> 3.      As far as I know most  enqueue in beignet default to be
> blocking(some related to GPU is not blocking) , you can see that api
> clFlush  is actually an empty function.
>
>
>
> Also I am trying to reproduce your bug here and I am setting up GROMACS.
> Is there anything I should know to run it with beignet?
>
>
>
> Thanks
>
> Xiuli
>
>
>
> *From:* Beignet [mailto:beignet-bounces at lists.freedesktop.org] *On Behalf
> Of *Szilárd Páll
> *Sent:* Thursday, March 31, 2016 3:14 AM
> *To:* beignet at lists.freedesktop.org
> *Subject:* Re: [Beignet] GROMACS on beignet
>
>
>
> Hello again,
>
>
>
> I have been trying to verify whether there may be assumptions >=32-wide
> execution hiding in the kernels (in particular in code that's using local
> memory for prefetching or reduction) and tried dropping in mem fences to
> test a few things, but at several points I managed to trigger the
> aforementioned error:
>
> drm_intel_gem_bo_context_exec() failed: Input/output error
>
>
>
> Is this a known issues? There have been reports of it, but perhaps it is
> just the manifestation of multiple possible issues?
>
>
>
> Secondly, I do not see the reason why I get blocking behavior of all
> enqueue operations (and I don't get this on NVIDIA or AMD). Are there any
> peculiarities I should be aware of?
>
>
>
> Cheers,
>
>
> --
> Szilárd
>
>
>
> On Mon, Mar 28, 2016 at 1:49 AM, Szilárd Páll <sin.pecado at gmail.com>
> wrote:
>
> Hi Xiuli,
>
>
>
> Thanks for the quick reply!
>
>
>
> On Fri, Mar 25, 2016 at 4:06 AM, Pan, Xiuli <xiuli.pan at intel.com> wrote:
>
> Hi Szilárd,
>
>
>
> What do you mean about quoted includes?
>
>
>
> I mean -I"/path/to/headers" does not work, but  -I/path/to/headers does.
>
>
>
> If you mean the include in kernels, I think we may have some problem with
> that. The *.cl we used for clang actually was a copied tmp version stored
> not in where is used to be. So I think if you just put what need to be
> included in the old place, clang could not find it. You could try a
> workaround to pass “-I where/your/header/is”  as a build option to
> clBuildProgram.
>
>
>
> Then if you have some double types used on Haswell it may have some
> problem. The hardware for HSW does not support double very well as we have
> refined our double support to hardware then, so HSW may have some issues
> with double type. If it is not the problem with double float, you can send
> your kernel as an attachment or report a bug on our Bugzilla(
> https://bugs.freedesktop.org) and we will tried to fix it.
>
>
>
> No double precision in the kernels.
>
>
>
> For now I'll post here, I feel like a bug report may be an overkill -
> especially as I can't provide a full repro case that does not involve
> building the entire application.
>
>
>
> I've attached a minimum set of source files that's needed to compile. We
> have pretty heavy preprocessor use that generates kernels for the different
> inputs / outputs / computation combinations, so one particular flavor
> that's known to produce incorrect results is generated compiling
> nbnxn_ocl_kernels.cl with the following flags:
>
>
>
> -D_WARPLESS_SOURCE_ -DGMX_OCL_FASTGEN -DEL_RF -DEELNAME=_ElecRF
> -DLJ_COMB_GEOM -DVDWNAME=_VdwLJCombGeom -DCENTRAL=22
> -DNBNXN_GPU_NCLUSTER_PER_SUPERCLUSTER=8 -DNBNXN_GPU_CLUSTER_SIZE=8
> -DNBNXN_GPU_JGROUP_SIZE=4 -DNBNXN_AVOID_SING_R2_INC=1.0e-12f
>
>
>
>
>
> Additionally I had a closer look and so far I have observed three issues
> (additional to the minor include issue mentione before):
>
>
>
> 1. If I do a manual prefetch into local memory followed by a mem fence
> (seenbnxn_ocl_kernel_nowarp.clh line 339), I get the following error:
>
> drm_intel_gem_bo_context_exec() failed: Input/output error
>
> The next kernel call then fails with CL_OUT_OF_RESOURCES.
>
> Without the manual prefetch it works better, but...
>
>
>
> 2. The results produced by the kernel are still somewhat off. It could be
> that I missed a subtle detail and the kernels still do not conform to the
> hardware's execution model. I'm very familar with Intel's hardware and
> these kernels were originally designed for 32/64 wide execution.
>
>
>
> 3. All task enqueue calls seem to be blocking.
>
>
>
>
>
> Thanks & Cheers,
>
> --
>
> Szilárd
>
>
>
>
>
> Thanks
>
> Xiuli
>
>
>
> *From:* Beignet [mailto:beignet-bounces at lists.freedesktop.org] *On Behalf
> Of *Szilárd Páll
> *Sent:* Friday, March 25, 2016 7:16 AM
> *To:* beignet at lists.freedesktop.org
> *Subject:* [Beignet] GROMACS on beignet
>
>
>
> Hi,
>
>
>
> I am a developer of the GROMACS (www.gromacs.org) molecular dynamics
> simulation package. We have OpenCL offload for some of the
> compute-intensive kernels which that works very well on AMD. I wanted to
> assess how feasible is to use an Intel iGPU in GROMACS and after jumping
> through some hoops I got a 4.2 kernel and beignet master installed.
>
>
>
> Then I ran into the first minor issue: it seems that beignet does not
> accept quoted includes although AFAIK the double-quoted include paths
> should be accepted, but that did not work. No big deal, it doesn't work
> with Apple's OpenCL either, but I thought I'd ask.
>
>
>
> However, the bigger issue is that running on Haswell (HD 4600, I think)
> the kernel produces results that are very off (while the very same source
> gives correct results on other platforms). I've not much time to dig
> deeper, but I thought I'd drop a mail maybe somebody is interested in
> helping out with tips or even tracking down where the issue is.
>
>
>
> Suggestions would be welcome!
>
>
>
> Cheers,
>
> --
> Szilárd
>
>
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.freedesktop.org/archives/beignet/attachments/20160401/32ed15e0/attachment-0001.html>