[Beignet] johntheripper/OpenCL clGetEventProfilingInfo issue

Fri Oct 24 23:57:12 PDT 2014

Could you try latest git master beignet? We have some major
performance improvement for some cases.
Not sure whether it is the case for JohnTheRipper, but worth to give a try.

Thanks,
Zhigang Gong.

On Sat, Oct 25, 2014 at 2:46 PM, Oleksii Shevchuk
<public.avatar at gmail.com> wrote:
> Zhigang Gong <zhigang.gong at gmail.com> writes:
>
>> This should be an application bug, according to OpenCL 1.2 spec:
>>
>>   CL_PROFILING_INFO_NOT_AVAILABLE if the CL_QUEUE_PROFILING_ENABLE
>>   flag is not set for the command-queue, if the execution status of
>> the command identified
>>   by event is not CL_COMPLETE or if event is a user event object.
>>
>> To make sure an event's state to be CL_COMPLETE, you need to call
>> clWaitForEvents()
>> rather than clFinish().
>>
>> According to spec, clFinish() is used to :
>>   blocks until all previously queued OpenCL commands in command_queue are issued
>>   to the associated device and have completed.
>>
>> It is not to update all the related event's state. And it is too
>> heavy, as it will wait for the command
>> to be completed. The event's CL_COMPLETE state means the command has
>> been flushed into
>> the GPU's command buffer and may haven't completed. It's used to do
>> GPU command queue
>> side synchronization. clFinish() is to synchronize with host CPU.
>>
>> I would recommend you to call clWaitForEvents before you call the
>> clGetEventProfilingInfo().
>> If you still met problems with that change, please let us know.
>>
>
> Thanks, It's works. Slow, but works. Maybe this is the problem with
> their implementation.
>
> Btw, i call clWaitForEvents for 1 event in list every time before
> calling clGetEventProfilingInfo on that event. Is it ok, or should I
> call it for a whole event list?
>
> Also, I use i915 driver with the next args:
> i915.modeset=1 i915.i915_enable_rc6=1 i915.i915_enable_fbc=1
> i915.lvds_downclock=1
>
> They shouldn't influence the speed, aren't they?
>
> // Some bench output:
>
> magnumripper_JohnTheRipper > run/john -format=Raw-MD5-opencl -te
> Device 0: Intel(R) HD Graphics IvyBridge M GT2
> Local worksize (LWS) 16, global worksize (GWS) 1048576
> Benchmarking: Raw-MD5-opencl [MD5 OpenCL (inefficient, development use only)]... DONE
> Raw:    30107K c/s real, 84468K c/s virtual
>
> magnumripper_JohnTheRipper > run/john -format=Raw-MD5 -te
> Will run 4 OpenMP threads
> Benchmarking: Raw-MD5 [MD5 128/128 AVX 12x]... (4xOMP) DONE
> Raw:    40206K c/s real, 10497K c/s virtual
>
> magnumripper_JohnTheRipper > run/john -format=ecnfs -te
> Unknown ciphertext format name requested
> magnumripper_JohnTheRipper > run/john -format=encfs -te
> Will run 4 OpenMP threads
> Benchmarking: EncFS [PBKDF2-SHA1 AES/Blowfish 8x SSE2]... (4xOMP) DONE
> Raw:    62.1 c/s real, 16.4 c/s virtual
>
> magnumripper_JohnTheRipper > run/john -format=encfs-opencl -te
> Will run 4 OpenMP threads
> Device 0: Intel(R) HD Graphics IvyBridge M GT2
> Local worksize (LWS) 64, global worksize (GWS) 64
> Benchmarking: encfs-opencl, EncFS [PBKDF2-SHA1 OpenCL 4x AES/Blowfish]... (4xOMP) DONE
> Raw:    7.8 c/s real, 4266 c/s virtual
>
> magnumripper_JohnTheRipper > run/john -format=PBKDF2-HMAC-SHA1-opencl -te
> Device 0: Intel(R) HD Graphics IvyBridge M GT2
> Local worksize (LWS) 64, global worksize (GWS) 8192
> Benchmarking: PBKDF2-HMAC-SHA1-opencl [PBKDF2-SHA1 OpenCL 4x]... DONE
> Raw:    12459 c/s real, 3276K c/s virtual
>
> magnumripper_JohnTheRipper > run/john -format=PBKDF2-HMAC-SHA1 -te
> Will run 4 OpenMP threads
> Benchmarking: PBKDF2-HMAC-SHA1 [PBKDF2-SHA1 8x SSE2]... (4xOMP) DONE
> Raw:    16062 c/s real, 5957 c/s virtual
>
> Thanks.
>
> // wbr
> // alxchk