parallelizing crashtest runs (was: minutes of ESC call ...)

Fri Oct 31 06:38:15 PDT 2014

Hey,

On Fri, Oct 31, 2014 at 2:23 PM, Christian Lohmaier
<lohmaier at googlemail.com> wrote:
> Hi *,
>
> On Thu, Oct 30, 2014 at 5:39 PM, Michael Meeks
> <michael.meeks at collabora.com> wrote:
>>
>> * Crashtest futures / automated test scripts (Markus)
>>     + call on Tuesday; new testing hardware.
>>     + result - get a Manitu server & leave room in the budget for
>>       ondemand Amazon instances (with spot pricing) if there is
>>       special need at some point.
>> [...]
>
> When I played with the crashtest setup I noticed some limitations in
> the current layout of the crashtest-setup that prevents just using
> lots of cores/high parallelism to get faster results.
>
> The problem is that it is parallelized per directory, but the amount
> of files in a directory is not evenly distributed at all. So when the
> script decides to start odt tests last, the whole set of odt files
> will only be tested in one thread, leaving the other CPU-cores idling
> around with nothing to do.
>
> I did add a sorting statement to the script, so it will start with the
> directories with most files[1], but even with that you run into the
> problem that towards the end of the testrun not all cores will be
> used. As the AMD Opterons in the Manitu ones are less capable per-cpu
> this will set a limit to how much you can accelerate the run by just
> assigning more cores to it.
>
> Didn't look into the overall setup to know whether just segmenting the
> large directories into smaller ones is easy to do or not (i.e instead
> of having one odt dir with 10500+ files, have 20 with ~ 500 each.
>
> ciao
> Christian
>
> [1] added the sorted statement that uses the number of files in the
> directory as the key to sort by:
>
> def get_numfiles(directory):
>     return len([f for f in os.listdir(directory)])
>
> def get_directories():
>     d='.'
>     directories = [o for o in os.listdir(d) if os.path.isdir(os.path.join(d,o))]
>     return sorted(directories, key=get_numfiles, reverse=True)

This is currently a known limitation but there are two solutions to the problem:

The quick and ugly one is to partition the directories into 100 file
directories. I have a script for that as I have done exactly that for
the memcheck run on the 70 core Largo server. It is a quick and ugly
implementation.
The clean and much better solution is to move away from directory
based invocation and partion by files on the fly. I have a
proof-of-concept somewhere on my machine and will push a working
version during the next days. This would even give us about half a day
on our current setup as ods and odt are normally the last two running
for about half a day longer than the rest of the script.

With both solutions this scales perfectly. We have already tested it
on the Largo server where I was able to keep a load of 70 for exactly
a week (with memcheck but that does only affect the overall runtime).

Regards,
Markus