[Beignet] CL_MEM_USE_HOST_PTR involve extra copy?

Tue Dec 2 21:57:36 PST 2014

On Wed, Dec 03, 2014 at 01:46:51PM +0800, spring_wind wrote:
> 
> 
> I noticed that in kernel's source(3.12.13) drm_clflush_page used clflush instruction.
> I wonder whether the clfush instruction only push data to LLC or RAM? If it is RAM, is it needless since CPU and GPU share LLC.

Right, they share the LLC, but doesn't share other level of caches,
so when swtich from CPU domain to GPU domain, CPU needs to flush CPU's
L1/L2 cache to LLC cache. When switch from GPU domain to CPU domain,
GPU needs to flush GPU's L3 cache to LLC cache.

Please be noted, the L3 cache I mentioned is in GPU domain.
> 
> We are planning to use beignet to accelerate video analysis and image processing. The data flow will involve input and output between GPU and CPU, we expect it will be not a bottleneck.
Although cache flush between L3 <--> LLC <--> L2/L1 is not as heavy
as the flush between LLC and memory, it's still not free. Especially
if your application need to do that domain switch freqently.

> 
> 
> 
> At 2014-12-03 10:36:28, "Zhigang Gong" <zhigang.gong at linux.intel.com> wrote:
> >On Wed, Dec 03, 2014 at 10:44:26AM +0800, spring_wind wrote:
> >> 1. clCreateBuffer(CL_MEM_USE_HOST_PTR)
> >> 2. clMapxx-->change the data pointed by host ptr--->clUnmapxxx
> >> 
> >> Is that right?
> >The usual use scenario is as below:
> >
> >1. clCreateBuffer(CL_MEM_USE_HOST_PTR)
> >2. clEnqueueNDRangeKernel() to enqueue some kernel to do some
> >   computation on GPU side and write to the above buffer.
> >3. clEnqueueMapBuffer() the buffer, and do some computation
> >   on CPU side and modify the buffer content pointed by host_ptr.
> >4. clEnqueueUnmapBuffer() to give the buffer back to GPU.
> >
> >The step 3 and 4 are synchronization points defined by OpenCL
> >spec which are mandatory on all platforms to access a buffer
> >pointed by host_ptr shared by GPU and CPU. At step3, not only
> >a GPU cache flush (flush from GPU's L3 cache to CPU's LLC cache),
> >but also need to wait for all the kernel enqueued in step 2
> >to be finished.
> >
> >
> >> 
> >> I think clUnmapxx must contain cache flush operation although it may not involve extra copy, correct me if am wrong.
> >
> >You are right.
> >The cache flush is always a must when we switch between GPU
> >domain and CPU domain. But you can see, the above flush is
> >not so heavy, as it's not a flush between cache and slow
> >memory.
> >
> >> I am very concerned about the beignet performance on our product.
> >
> >You are welcome to share your detail concerns with us if possible.
> >And there are also some performance hint in the 'docs/optimization-guide.mdwn'.
> >We will continue to update that document.
> >
> >Thanks,
> >Zhigang Gong.
> >
> >> 
> >> 
> >> 
> >> 
> >> 
> >> 
> >> 
> >> 在 2014-12-03 10:14:10，"Guo, Yejun" <yejun.guo at intel.com> 写道：
> >> 
> >> 
> >> Just need to invoke clMapxx function before you access the host ptr from CPU side, and invoke clUnmapxx function after the CPU access. There is no extra copy inside clMap/Unmapxx if all the conditions are satisfied. You can refer to the code in runtime_use_host_ptr_buffer.cpp under beignet/utests.
> >> 
> >>  
> >> 
> >>  
> >> 
> >> From: spring_wind [mailto:spring_wind at yeah.net]
> >> Sent: Wednesday, December 03, 2014 10:02 AM
> >> To: Guo, Yejun
> >> Cc:beignet at lists.freedesktop.org
> >> Subject: Re:RE: [Beignet] CL_MEM_USE_HOST_PTR involve extra copy?
> >> 
> >>  
> >> 
> >> You mean if I use CL_MEM_USE_HOST_PTR and host ptr is page aligned, I change the data pointed by host ptr from CPU side, I don't have to do any flush opertaion or I should still call clMapxx function?
> >> 
> >> 
> >> 
> >> 
> >> 
> >> 
> >> 在 2014-12-03 08:37:15，"Guo, Yejun" <yejun.guo at intel.com> 写道：
> >> 
> >> 
> >> 
> >> Hi,
> >> 
> >>  
> >> 
> >> please check the latest code of beignet, there is no copy needed between CPU and GPU if the host_ptr provided by application is page aligned, and the page align limitation is expected to be removed some days later. You can also try CL_MEM_ALLOC_HOST_PTR to avoid the extra copy without align limitation.
> >> 
> >>  
> >> 
> >> Btw, this zero-copy is supported starting from linux kernel 3.16 and libdrm 2.4.58.
> >> 
> >>  
> >> 
> >>  
> >> 
> >> From: Beignet [mailto:beignet-bounces at lists.freedesktop.org] On Behalf Of spring_wind
> >> Sent: Tuesday, December 02, 2014 8:38 PM
> >> To:beignet at lists.freedesktop.org
> >> Subject: [Beignet] CL_MEM_USE_HOST_PTR involve extra copy?
> >> 
> >>  
> >> 
> >> Hi:
> >> 
> >> Intel opencl optimization guide said using CL_MEM_USE_HOST_PTR can avoid extra copy between CPU and GPU, but I noticed that in beignet's implementation source it was not like that:
> >> 
> >> /* Copy the data if required */
> >> 
> >>   if (flags & CL_MEM_COPY_HOST_PTR || flags & CL_MEM_USE_HOST_PTR)
> >> 
> >>     cl_buffer_subdata(mem->bo, 0, sz, data);
> >> 
> >>  
> >> 
> >> Could someone give me an answer?
> >> 
> >>  
> >> 
> >>  
> >
> >> _______________________________________________
> >> Beignet mailing list
> >> Beignet at lists.freedesktop.org
> >> http://lists.freedesktop.org/mailman/listinfo/beignet
> >
> >_______________________________________________
> >Beignet mailing list
> >Beignet at lists.freedesktop.org
> >http://lists.freedesktop.org/mailman/listinfo/beignet