parallelizing crashtest runs (was: minutes of ESC call ...)

Sun Nov 2 16:28:13 PST 2014

Hey,

On Fri, Oct 31, 2014 at 2:45 PM, Christian Lohmaier
<lohmaier at googlemail.com> wrote:
> Hi Markus, *,
>
> On Fri, Oct 31, 2014 at 2:38 PM, Markus Mohrhard
> <markus.mohrhard at googlemail.com> wrote:
>>
>> The quick and ugly one is to partition the directories into 100 file
>> directories. I have a script for that as I have done exactly that for
>> the memcheck run on the 70 core Largo server. It is a quick and ugly
>> implementation.
>> The clean and much better solution is to move away from directory
>> based invocation and partion by files on the fly.
>
> Yeah, I also thought of keeping the per-directory/filetype processing,
> but instead run multiple dirs at once, rather divide the set of files
> of a given dir into the <number of workers> chunks.
>
>> I have a
>> proof-of-concept somewhere on my machine and will push a working
>> version during the next days.
>
> nice :-)
>

So a working version is currently running on the VM. The version in
the repo will be updated as soon as the script finishes without a
problem. It parallelizes now nearly perfectly as it divides the work
in 100 file chunks and works on them. This means that after the last
update of the test files we have 641 jobs that will be put into a
queue and we process as many jobs in parallel as we want (5 at the VM
at the moment).

Additionally the updated version of the script no longer hard codes a
mapping from the file extension to the component and instead queries
LibreOffice to see which component opened the file. That allows to
remove quite a few mappings and will result in all file types to be
imported. The old version only imported file types that were
registered.

The new script should scale nearly perfectly. There are still a few
enhancements on my list so if anyone is interested in python tasks
please talk to me.

Regards,
Markus