[Beignet] [PATCH] Refine the cl thread implement for queue.

Mon May 26 00:04:03 PDT 2014

I think the batch buffer is just one kind of resource.
The other resources such as Image buffers, data buffers are also
needed to keep intact for each thread.
Unless you add a lock at the NDRQueue entrance, which begin to prepare
all the required resources. But I think that lock should be too heavy.

On Mon, 2014-05-26 at 09:37 +0800, Zhigang Gong wrote:
> And I just checked the clFinish and clFlush, they only need to access the
> batch buffer.
> So the root cause is that we always allocate a new batch buffer for a new
> kernel submitting
> for a queue. Even if there are many kernel enqueuing on the same queue.
> 
> If we can maintain a uniform batch buffer for the single queue, then this
> issue will be
> solved clearly and gracefully.
> 
> IMO, this is not the OpenCL spec issue. This is a implementation issue which
> we should solved
> In the future. What's your opinion?
> 
> BTW, I'm ok with current implementation. But I found you may missed some
> minor comments
> Which embedded in my first email. Could you recheck it and solve those
> comment and
> Send a new version of the patch?
> 
> > -----Original Message-----
> > From: Zhigang Gong [mailto:zhigang.gong at linux.intel.com]
> > Sent: Monday, May 26, 2014 9:31 AM
> > To: 'He Junyan'
> > Cc: 'Junyan He'; 'beignet at lists.freedesktop.org'
> > Subject: RE: [Beignet] [PATCH] Refine the cl thread implement for queue.
> > 
> > Ok. The key issue is that the private gpgpu data structure is still needed
> after
> > each kernel execution.
> > And the gpgpu data is different for each kernel execution, right?
> > Could you list all of the scenarios where we need to use the gpgpu data
> after
> > the kernel submitting?
> > I can think of the following two:
> > 1. clFinish
> > 2. clFlush
> > 
> > Is there any other cases?
> > 
> > > -----Original Message-----
> > > From: Beignet [mailto:beignet-bounces at lists.freedesktop.org] On Behalf
> > > Of He Junyan
> > > Sent: Friday, May 23, 2014 11:36 PM
> > > To: Zhigang Gong
> > > Cc: Junyan He; beignet at lists.freedesktop.org
> > > Subject: Re: [Beignet] [PATCH] Refine the cl thread implement for queue.
> > >
> > > I think it is hard to avoid using thread local data.
> > > Because when the queue creating, we do not know how many threads will
> > > use this queue later.
> > > The GPGPU resources will be hold in the queue, but at least every
> > > thread should have a local data to store the index to find the GPGPU
> data in
> > the queue.
> > > And the thread should need to destroy the GPGPU resource when the
> > > thread exit, while the queue's life time may be much longer than the
> thread.
> > >
> > > OpenCL spec fail to define the relationship between the queue and the
> > threads.
> > > This cause the dilemma.
> > >
> > > Please give some good advices if any.
> > >
> > > On Fri, 2014-05-23 at 16:33 +0800, Zhigang Gong wrote:
> > > > Some minor comments as below.
> > > >
> > > > One thought about the usage of thread local data we are using here.
> > > > The original reason why we want to use thread local data is to avoid
> > > > lock as much as possible. But finally, we found to satisfy all the
> > > > use scenario, we can't avoid lock any way. Now we introduce lock
> > > > eventually. Then is there still good reason why we should use these
> > > > thread local data any more?
> > > >
> > > > One possible question is as below:
> > > >
> > > > If one queue is used in another thread to enqueue task, does it make
> > > > sense to create a thread local new gpgpu data and in this thread. Or
> > > > we can just simply lock and wait for other thread to unlock the queue?
> > > >
> > > > On Tue, May 20, 2014 at 02:26:47PM +0800, junyan.he at inbox.com wrote:
> > > > > From: Junyan He <junyan.he at linux.intel.com>
> > > > >
> > > > > Because the cl_command_queue can be used in several threads
> > > > > simultaneously but without add ref to it, we now handle it like
> this:
> > > > > Keep one threads_slot_array, every time the thread get gpgpu or
> > > > > batch buffer, if it does not have a slot, assign it.
> > > > > The resources are keeped in queue private, and resize it if needed.
> > > > > When the thread exit, the slot will be set invalid.
> > > > > When queue released, all the resources will be released. If user
> > > > > still enqueue, flush or finish the queue after it has been
> > > > > released, the
> > > behavior is undefined.
> > > > > TODO: Need to shrink the slot map.
> > > > >
> > > > > Signed-off-by: Junyan He <junyan.he at linux.intel.com>
> > > > > ---
> > > > >  src/cl_command_queue.c      |   6 +-
> > > > >  src/cl_command_queue_gen7.c |   2 +-
> > > > >  src/cl_context.c            |   2 +-
> > > > >  src/cl_device_id.c          |   2 +-
> > > > >  src/cl_thread.c             | 261
> > > +++++++++++++++++++++++++++++++++-----------
> > > > >  src/cl_thread.h             |   6 +-
> > > > >  6 files changed, 205 insertions(+), 74 deletions(-)
> > > > >
> > > > > diff --git a/src/cl_command_queue.c b/src/cl_command_queue.c index
> > > > > 6a699c0..802d313 100644
> > > > > --- a/src/cl_command_queue.c
> > > > > +++ b/src/cl_command_queue.c
> > > > > @@ -60,6 +60,7 @@ cl_command_queue_new(cl_context ctx)
> > > > >    /* The queue also belongs to its context */
> > > > >    cl_context_add_ref(ctx);
> > > > >
> > > > > +
> > > > useless new line.
> > > > >  exit:
> > > > >    return queue;
> > > > >  error:
> > > > > @@ -74,6 +75,7 @@ cl_command_queue_delete(cl_command_queue
> > > queue)
> > > > >    assert(queue);
> > > > >    if (atomic_dec(&queue->ref_n) != 1) return;
> > > > >
> > > > > +  if (queue->ref_n < 0)
> > > > weird if statement here? What's the purpose?
> > > > >    /* Remove it from the list */
> > > > >    assert(queue->ctx);
> > > > >    pthread_mutex_lock(&queue->ctx->queue_lock);
> > > > > @@ -89,7 +91,7 @@ cl_command_queue_delete(cl_command_queue
> > > queue)
> > > > >      queue->fulsim_out = NULL;
> > > > >    }
> > > > >
> > > > > -  cl_thread_data_destroy(queue->thread_data);
> > > > > +  cl_thread_data_destroy(queue);
> > > > >    queue->thread_data = NULL;
> > > > >    cl_mem_delete(queue->perf);
> > > > >    cl_context_delete(queue->ctx);
> > > > > @@ -430,7 +432,7 @@ cl_command_queue_flush(cl_command_queue
> > > queue)
> > > > > LOCAL cl_int  cl_command_queue_finish(cl_command_queue queue)  {
> > > > > -  cl_gpgpu_sync(cl_get_thread_batch_buf());
> > > > > +  cl_gpgpu_sync(cl_get_thread_batch_buf(queue));
> > > > >    return CL_SUCCESS;
> > > > >  }
> > > > >
> > > > > diff --git a/src/cl_command_queue_gen7.c
> > > > > b/src/cl_command_queue_gen7.c index 891d6f1..d875021 100644
> > > > > --- a/src/cl_command_queue_gen7.c
> > > > > +++ b/src/cl_command_queue_gen7.c
> > > > > @@ -327,7 +327,7 @@
> > > cl_command_queue_ND_range_gen7(cl_command_queue queue,
> > > > >    /* Start a new batch buffer */
> > > > >    batch_sz = cl_kernel_compute_batch_sz(ker);
> > > > >    cl_gpgpu_batch_reset(gpgpu, batch_sz);
> > > > > -  cl_set_thread_batch_buf(cl_gpgpu_ref_batch_buf(gpgpu));
> > > > > +  cl_set_thread_batch_buf(queue, cl_gpgpu_ref_batch_buf(gpgpu));
> > > > >    cl_gpgpu_batch_start(gpgpu);
> > > > >
> > > > >    /* Issue the GPGPU_WALKER command */ diff --git
> > > > > a/src/cl_context.c b/src/cl_context.c index 8190e6a..452afb8
> > > > > 100644
> > > > > --- a/src/cl_context.c
> > > > > +++ b/src/cl_context.c
> > > > > @@ -203,7 +203,7 @@ cl_context_delete(cl_context ctx)
> > > > >    assert(ctx->buffers == NULL);
> > > > >    assert(ctx->drv);
> > > > >    cl_free(ctx->prop_user);
> > > > > -  cl_set_thread_batch_buf(NULL);
> > > > > +//  cl_set_thread_batch_buf(NULL);
> > > > why not just remove the useless code?
> > > >
> > > > >    cl_driver_delete(ctx->drv);
> > > > >    ctx->magic = CL_MAGIC_DEAD_HEADER; /* For safety */
> > > > >    cl_free(ctx);
> > > > > diff --git a/src/cl_device_id.c b/src/cl_device_id.c index
> > > > > 268789a..15a776d 100644
> > > > > --- a/src/cl_device_id.c
> > > > > +++ b/src/cl_device_id.c
> > > > > @@ -106,7 +106,7 @@ LOCAL cl_device_id
> > > > >  cl_get_gt_device(void)
> > > > >  {
> > > > >    cl_device_id ret = NULL;
> > > > > -  cl_set_thread_batch_buf(NULL);
> > > > > + // cl_set_thread_batch_buf(NULL);
> > > > ditto?
> > > >
> > > > >    const int device_id = cl_driver_get_device_id();
> > > > >    cl_device_id device = NULL;
> > > > >
> > > > > diff --git a/src/cl_thread.c b/src/cl_thread.c index
> > > > > cadc3cd..1b517ac 100644
> > > > > --- a/src/cl_thread.c
> > > > > +++ b/src/cl_thread.c
> > > > > @@ -15,113 +15,242 @@
> > > > >   * License along with this library. If not, see
> > > <http://www.gnu.org/licenses/>.
> > > > >   *
> > > > >   */
> > > > > +#include <string.h>
> > > > > +#include <stdio.h>
> > > > >
> > > > >  #include "cl_thread.h"
> > > > >  #include "cl_alloc.h"
> > > > >  #include "cl_utils.h"
> > > > >
> > > > > -static __thread void* thread_batch_buf = NULL;
> > > > > +/* Because the cl_command_queue can be used in several threads
> > > simultaneously but
> > > > > +   without add ref to it, we now handle it like this:
> > > > > +   Keep one threads_slot_array, every time the thread get gpgpu
> > > > > + or
> > > batch buffer, if it
> > > > > +   does not have a slot, assign it.
> > > > > +   The resources are keeped in queue private, and resize it if
> needed.
> > > > > +   When the thread exit, the slot will be set invalid.
> > > > > +   When queue released, all the resources will be released. If
> > > > > + user still
> > > enqueue, flush
> > > > > +   or finish the queue after it has been released, the behavior
> > > > > + is
> > > undefined.
> > > > > +   TODO: Need to shrink the slot map.
> > > > > +   */
> > > > >
> > > > > -typedef struct _cl_thread_spec_data {
> > > > > +static int thread_array_num = 1;
> > > > > +static int *thread_slot_map = NULL; static int thread_magic_num =
> > > > > +1; static pthread_mutex_t thread_queue_map_lock =
> > > > > +PTHREAD_MUTEX_INITIALIZER; static pthread_key_t destroy_key;
> > > > > +
> > > > > +static __thread int thread_id = -1; static __thread int
> > > > > +thread_magic = -1;
> > > > > +
> > > > > +typedef struct _thread_spec_data {
> > > > >    cl_gpgpu gpgpu ;
> > > > >    int valid;
> > > > > -}cl_thread_spec_data;
> > > > > +  void* thread_batch_buf;
> > > > > +  int thread_magic;
> > > > > +} thread_spec_data;
> > > > >
> > > > > -void cl_set_thread_batch_buf(void* buf) {
> > > > > -  if (thread_batch_buf) {
> > > > > -    cl_gpgpu_unref_batch_buf(thread_batch_buf);
> > > > > -  }
> > > > > -  thread_batch_buf = buf;
> > > > > -}
> > > > > +typedef struct _queue_thread_private {
> > > > > +  thread_spec_data**  threads_data;
> > > > > +  int threads_data_num;
> > > > > +  pthread_mutex_t thread_data_lock; } queue_thread_private;
> > > > >
> > > > > -void* cl_get_thread_batch_buf(void) {
> > > > > -  return thread_batch_buf;
> > > > > +static void thread_data_destructor(void *dummy) {
> > > > > +  pthread_mutex_lock(&thread_queue_map_lock);
> > > > > +  thread_slot_map[thread_id] = 0;
> > > > > +  pthread_mutex_unlock(&thread_queue_map_lock);
> > > > > +  free(dummy);
> > > > >  }
> > > > >
> > > > > -cl_gpgpu cl_get_thread_gpgpu(cl_command_queue queue)
> > > > > +static thread_spec_data *
> > > > > +__create_thread_spec_data(cl_command_queue queue, int create)
> > > > >  {
> > > > > -  pthread_key_t* key = queue->thread_data;
> > > > > -  cl_thread_spec_data* thread_spec_data =
> > > > > pthread_getspecific(*key);
> > > > > -
> > > > > -  if (!thread_spec_data) {
> > > > > -    TRY_ALLOC_NO_ERR(thread_spec_data, CALLOC(struct
> > > _cl_thread_spec_data));
> > > > > -    if (pthread_setspecific(*key, thread_spec_data)) {
> > > > > -      cl_free(thread_spec_data);
> > > > > -      return NULL;
> > > > > +  queue_thread_private *thread_private = ((queue_thread_private
> > > > > + *)(queue->thread_data));
> > > > > +  thread_spec_data* spec = NULL;
> > > > > +  int i = 0;
> > > > > +  int *old;
> > > > > +
> > > > > +  if (thread_id == -1) {
> > > > > +    void * dummy = malloc(sizeof(int));
> > > > > +
> > > > > +    pthread_mutex_lock(&thread_queue_map_lock);
> > > > > +    for (i = 0; i < thread_array_num; i++) {
> > > > > +      if (thread_slot_map[i] == 0) {
> > > > > +        thread_id = i;
> > > > > +        break;
> > > > > +      }
> > > > > +    }
> > > > > +
> > > > > +    if (i == thread_array_num) {
> > > > > +      thread_array_num *= 2;
> > > > > +      old = thread_slot_map;
> > > > > +      thread_slot_map = malloc(sizeof(int) * thread_array_num);
> > > > > +      memset(thread_slot_map, 0, (sizeof(int) * thread_array_num));
> > > > > +      for (i = 0; i < thread_array_num/2; i++) {
> > > > > +        thread_slot_map[i] = old[i];
> > > > > +      }
> > > > > +
> > > > > +      thread_id = thread_array_num/2;
> > > > > +      cl_free(old);
> > > > why not just use a realloc() here?
> > > > > +    }
> > > > > +
> > > > > +    thread_slot_map[thread_id] = 1;
> > > > > +
> > > > > +    thread_magic = thread_magic_num++;
> > > > > +    pthread_mutex_unlock(&thread_queue_map_lock);
> > > > > +
> > > > > +    pthread_setspecific(destroy_key, dummy);  }
> > > > > +
> > > > > +  pthread_mutex_lock(&thread_private->thread_data_lock);
> > > > > +  if (thread_array_num > thread_private->threads_data_num) {//
> > > > > + just
> > > enlarge
> > > > > +    thread_spec_data ** old = thread_private->threads_data;
> > > > > +    int old_num = thread_private->threads_data_num;
> > > > > +    thread_private->threads_data_num = thread_array_num;
> > > > > +    thread_private->threads_data =
> > > malloc(thread_private->threads_data_num * sizeof(void *));
> > > > > +    memset(thread_private->threads_data, 0, sizeof(void*) *
> > > thread_private->threads_data_num);
> > > > > +    for (i = 0; i < old_num; i++) {
> > > > > +      thread_private->threads_data[i] = old[i];
> > > > >      }
> > > > > +    free(old);
> > > >       Another place which may better to use realloc;
> > > > >    }
> > > > >
> > > > > -  if (!thread_spec_data->valid) {
> > > > > -    TRY_ALLOC_NO_ERR(thread_spec_data->gpgpu,
> > > cl_gpgpu_new(queue->ctx->drv));
> > > > > -    thread_spec_data->valid = 1;
> > > > > +  assert(thread_id != -1 && thread_id < thread_array_num);
> > > > > +  spec = thread_private->threads_data[thread_id];
> > > > > +  if (!spec && create) {
> > > > > +	spec = CALLOC(thread_spec_data);
> > > > > +	spec->thread_magic = thread_magic;
> > > > > +	thread_private->threads_data[thread_id] = spec;
> > > > >    }
> > > > >
> > > > > -error:
> > > > > -  return thread_spec_data->gpgpu;
> > > > > +  pthread_mutex_unlock(&thread_private->thread_data_lock);
> > > > > +
> > > > > +  return spec;
> > > > >  }
> > > > >
> > > > > -void cl_invalid_thread_gpgpu(cl_command_queue queue)
> > > > > +void* cl_thread_data_create(void)
> > > > >  {
> > > > > -  pthread_key_t* key = queue->thread_data;
> > > > > -  cl_thread_spec_data* thread_spec_data =
> > > > > pthread_getspecific(*key);
> > > > > +  queue_thread_private* thread_private =
> > > > > + CALLOC(queue_thread_private);
> > > > >
> > > > > -  if (!thread_spec_data) {
> > > > > -    return;
> > > > > -  }
> > > > > +  if (thread_private == NULL)
> > > > > +    return NULL;
> > > > >
> > > > > -  if (!thread_spec_data->valid) {
> > > > > -    return;
> > > > > +  if (thread_slot_map == NULL) {
> > > > > +    pthread_mutex_lock(&thread_queue_map_lock);
> > > > > +    thread_slot_map = malloc(sizeof(int) * thread_array_num);
> > > > > +    memset(thread_slot_map, 0, (sizeof(int) * thread_array_num));
> > > >        calloc = malloc + memset to zero.
> > > > > +    pthread_mutex_unlock(&thread_queue_map_lock);
> > > > > +
> > > > > +    pthread_key_create(&destroy_key, thread_data_destructor);
> > > > >    }
> > > > >
> > > > > -  assert(thread_spec_data->gpgpu);
> > > > > -  cl_gpgpu_delete(thread_spec_data->gpgpu);
> > > > > -  thread_spec_data->valid = 0;
> > > > > +  pthread_mutex_init(&thread_private->thread_data_lock, NULL);
> > > > > +
> > > > > +  pthread_mutex_lock(&thread_private->thread_data_lock);
> > > > > +  thread_private->threads_data = malloc(thread_array_num *
> > > > > + sizeof(void *));  memset(thread_private->threads_data, 0,
> > > > > + sizeof(void*) * thread_array_num);
> > > > > + thread_private->threads_data_num = thread_array_num;
> > > > > + pthread_mutex_unlock(&thread_private->thread_data_lock);
> > > > > +
> > > > > +  return thread_private;
> > > > >  }
> > > > >
> > > > > -static void thread_data_destructor(void *data) {
> > > > > -  cl_thread_spec_data* thread_spec_data = (cl_thread_spec_data
> > > > > *)data;
> > > > > +cl_gpgpu cl_get_thread_gpgpu(cl_command_queue queue) {
> > > > > +  thread_spec_data* spec = __create_thread_spec_data(queue, 1);
> > > > >
> > > > > -  if (thread_batch_buf) {
> > > > > -    cl_gpgpu_unref_batch_buf(thread_batch_buf);
> > > > > -    thread_batch_buf = NULL;
> > > > > +  if (!spec->thread_magic && spec->thread_magic != thread_magic) {
> > > > > +    //We may get the slot from last thread. So free the resource.
> > > > > +    spec->valid = 0;
> > > > >    }
> > > > >
> > > > > -  if (thread_spec_data->valid)
> > > > > -    cl_gpgpu_delete(thread_spec_data->gpgpu);
> > > > > -  cl_free(thread_spec_data);
> > > > > +  if (!spec->valid) {
> > > > > +    if (spec->thread_batch_buf) {
> > > > > +      cl_gpgpu_unref_batch_buf(spec->thread_batch_buf);
> > > > > +      spec->thread_batch_buf = NULL;
> > > > > +    }
> > > > > +    if (spec->gpgpu) {
> > > > > +      cl_gpgpu_delete(spec->gpgpu);
> > > > > +      spec->gpgpu = NULL;
> > > > > +    }
> > > > > +    TRY_ALLOC_NO_ERR(spec->gpgpu,
> > cl_gpgpu_new(queue->ctx->drv));
> > > > > +    spec->valid = 1;
> > > > > +  }
> > > > > +
> > > > > + error:
> > > > > +  return spec->gpgpu;
> > > > >  }
> > > > >
> > > > > -/* Create the thread specific data. */
> > > > > -void* cl_thread_data_create(void)
> > > > > +void cl_set_thread_batch_buf(cl_command_queue queue, void* buf)
> > > > >  {
> > > > > -  int rc = 0;
> > > > > +  thread_spec_data* spec = __create_thread_spec_data(queue, 1);
> > > > >
> > > > > -  pthread_key_t *thread_specific_key = CALLOC(pthread_key_t);
> > > > > -  if (thread_specific_key == NULL)
> > > > > -    return NULL;
> > > > > +  assert(spec && spec->thread_magic == thread_magic);
> > > > > +
> > > > > +  if (spec->thread_batch_buf) {
> > > > > +    cl_gpgpu_unref_batch_buf(spec->thread_batch_buf);
> > > > > +  }
> > > > > +  spec->thread_batch_buf = buf;
> > > > > +}
> > > > >
> > > > > -  rc = pthread_key_create(thread_specific_key,
> > > > > thread_data_destructor);
> > > > > +void* cl_get_thread_batch_buf(cl_command_queue queue) {
> > > > > +  thread_spec_data* spec = __create_thread_spec_data(queue, 1);
> > > > >
> > > > > -  if (rc != 0)
> > > > > -    return NULL;
> > > > > +  assert(spec && spec->thread_magic == thread_magic);
> > > > > +
> > > > > +  return spec->thread_batch_buf;
> > > > > +}
> > > > > +
> > > > > +void cl_invalid_thread_gpgpu(cl_command_queue queue) {
> > > > > +  queue_thread_private *thread_private = ((queue_thread_private
> > > > > +*)(queue->thread_data));
> > > > > +  thread_spec_data* spec = NULL;
> > > > >
> > > > > -  return thread_specific_key;
> > > > > +  pthread_mutex_lock(&thread_private->thread_data_lock);
> > > > > +  spec = thread_private->threads_data[thread_id];
> > > > > +  assert(spec);
> > > > > +  pthread_mutex_unlock(&thread_private->thread_data_lock);
> > > > > +
> > > > > +  if (!spec->valid) {
> > > > > +    return;
> > > > > +  }
> > > > > +
> > > > > +  assert(spec->gpgpu);
> > > > > +  cl_gpgpu_delete(spec->gpgpu);
> > > > > +  spec->gpgpu = NULL;
> > > > > +  spec->valid = 0;
> > > > >  }
> > > > >
> > > > >  /* The destructor for clean the thread specific data. */ -void
> > > > > cl_thread_data_destroy(void * data)
> > > > > +void cl_thread_data_destroy(cl_command_queue queue)
> > > > >  {
> > > > > -  pthread_key_t *thread_specific_key = (pthread_key_t *)data;
> > > > > -
> > > > > -  /* First release self spec data. */
> > > > > -  cl_thread_spec_data* thread_spec_data =
> > > > > -         pthread_getspecific(*thread_specific_key);
> > > > > -  if (thread_spec_data && thread_spec_data->valid) {
> > > > > -    cl_gpgpu_delete(thread_spec_data->gpgpu);
> > > > > -    if (thread_spec_data)
> > > > > -      cl_free(thread_spec_data);
> > > > > +  int i = 0;
> > > > > +  queue_thread_private *thread_private = ((queue_thread_private
> > > > > + *)(queue->thread_data));  int threads_data_num;
> > > > > +  thread_spec_data** threads_data;
> > > > > +
> > > > > +  pthread_mutex_lock(&thread_private->thread_data_lock);
> > > > > +  assert(thread_private->threads_data_num == thread_array_num);
> > > > > + threads_data_num = thread_private->threads_data_num;
> > > threads_data
> > > > > + = thread_private->threads_data;
> > > > > + thread_private->threads_data_num = 0;
> > > > > + thread_private->threads_data = NULL;
> > > > > + pthread_mutex_unlock(&thread_private->thread_data_lock);
> > > > > +  cl_free(thread_private);
> > > > > +  queue->thread_data = NULL;
> > > > > +
> > > > > +  for (i = 0; i < threads_data_num; i++) {
> > > > > +    if (threads_data[i] != NULL &&
> threads_data[i]->thread_batch_buf)
> > {
> > > > > +      cl_gpgpu_unref_batch_buf(threads_data[i]->thread_batch_buf);
> > > > > +      threads_data[i]->thread_batch_buf = NULL;
> > > > > +    }
> > > > > +
> > > > > +    if (threads_data[i] != NULL && threads_data[i]->valid) {
> > > > > +      cl_gpgpu_delete(threads_data[i]->gpgpu);
> > > > > +      threads_data[i]->gpgpu = NULL;
> > > > > +      threads_data[i]->valid = 0;
> > > > > +    }
> > > > > +    cl_free(threads_data[i]);
> > > > >    }
> > > > >
> > > > > -  pthread_key_delete(*thread_specific_key);
> > > > > -  cl_free(thread_specific_key);
> > > > > +  cl_free(threads_data);
> > > > >  }
> > > > > diff --git a/src/cl_thread.h b/src/cl_thread.h index
> > > > > c8ab63c..bf855a2 100644
> > > > > --- a/src/cl_thread.h
> > > > > +++ b/src/cl_thread.h
> > > > > @@ -27,7 +27,7 @@
> > > > >  void* cl_thread_data_create(void);
> > > > >
> > > > >  /* The destructor for clean the thread specific data. */ -void
> > > > > cl_thread_data_destroy(void * data);
> > > > > +void cl_thread_data_destroy(cl_command_queue queue);
> > > > >
> > > > >  /* Used to get the gpgpu struct of each thread. */  cl_gpgpu
> > > > > cl_get_thread_gpgpu(cl_command_queue queue); @@ -36,9 +36,9 @@
> > > > > cl_gpgpu cl_get_thread_gpgpu(cl_command_queue queue);  void
> > > > > cl_invalid_thread_gpgpu(cl_command_queue queue);
> > > > >
> > > > >  /* Used to set the batch buffer of each thread. */ -void
> > > > > cl_set_thread_batch_buf(void* buf);
> > > > > +void cl_set_thread_batch_buf(cl_command_queue queue, void* buf);
> > > > >
> > > > >  /* Used to get the batch buffer of each thread. */
> > > > > -void* cl_get_thread_batch_buf(void);
> > > > > +void* cl_get_thread_batch_buf(cl_command_queue queue);
> > > > >
> > > > >  #endif /* __CL_THREAD_H__ */
> > > > > --
> > > > > 1.8.3.2
> > > > >
> > > > > _______________________________________________
> > > > > Beignet mailing list
> > > > > Beignet at lists.freedesktop.org
> > > > > http://lists.freedesktop.org/mailman/listinfo/beignet
> > > > _______________________________________________
> > > > Beignet mailing list
> > > > Beignet at lists.freedesktop.org
> > > > http://lists.freedesktop.org/mailman/listinfo/beignet
> > >
> > >
> > >
> > > _______________________________________________
> > > Beignet mailing list
> > > Beignet at lists.freedesktop.org
> > > http://lists.freedesktop.org/mailman/listinfo/beignet
> 
> _______________________________________________
> Beignet mailing list
> Beignet at lists.freedesktop.org
> http://lists.freedesktop.org/mailman/listinfo/beignet