select() timeouts on large installations

Mon Aug 22 19:17:30 PDT 2005

On Mon, 2005-08-22 at 18:08 +0200, Cornelia Huck wrote:
> Hi list,

Hi,

> 
> I'm running into some problems when trying to start the HAL daemon on
> large installations (like a S/390 LPAR with several thousands of
> devices). Device detection may take quite some time, more than the 25
> seconds specified as a timeout value for select() in hald/hald.c, and as
> a result, the daemon will abort.

Ugh.

> 
> I've tried specifying a higher timeout, which works for me, but seems a
> bit dumb (who guarantees us that there is no installation which needs
> even more time?). 

I guess setting the timeout to infinity would help in the general case.
The daemon doing abort is bad bad bad...

> strace doesn't show any place where the daemon spends
> too much time waiting, it seems to be busy all the time gathering
> information. Any idea on how device detection may be speeded up a bit?

I'd turn on verbose logging and redirect stderr to a file to see what is
taking so long. Assuming we're CPU bound, profiling may help too, e.g.
what functions are we spending time in.. 

Btw, noone tried optimizing this before... there should be some
low-hanging fruit... For instance looking up a property is O(n) and it
could *easily* be made O(1) using e.g. GHashTable...

I've applied your patch for s/25/250/ for now though...

Cheers,
David