[Pm-utils] Stale lock file causing abort

Fri Nov 7 08:10:48 PST 2008

On Fri, 07 Nov 2008 06:14:14 -0600
Victor Lowther <victor.lowther at gmail.com> wrote:

> On Tue, 2008-11-04 at 21:55 -0600, Robby Workman wrote:
> > I've gotten a report of pm-utils failing to work suddenly working
> > successfully in the past, and through use of PM_DEBUG, I've traced
> > it to the presence of a
> > stale /var/run/pm-utils/locks/pm-suspend.lock being present.
> > Looking at /usr/lib/pm-utils/functions, I don't see anything
> > obviously wrong with locking functions, and of course it wouldn't
> > be prudent to trap abnormal exits with lock removal, as that would
> > paper over what's likely a real problem somewhere.  
> 
> Actually, pm-action removes locks via the shell trap mechanism -- the
> pm-suspend lock should be removed no matter how the script exits.  The
> only exceptions are if pm-action is kill -9'ed, or if the system is
> restarted instead of resuming.  We don't do anything about the kill -9
> case, and on the reboot case there we rely on the FHS spec that says
> the distro should clean out /var/run on reboot
> (http://www.pathname.com/fhs/pub/fhs-2.3.html#VARRUNRUNTIMEVARIABLEDATA).
> Does Slack do that?

Not completely, no.  We have several things that use subdirectories
of /var/run, and at least one of them doesn't create its subdirectory
on its own if it doesn't already exist, so it doesn't start (this is
HAL, btw).  I've been meaning to see about getting that addressed
upstream (and yeah, I know we could work around it in the init script),
but I keep forgetting  :/    I know some distributions put /var/run on
a tmpfs, which is probably a decent approach, but I don't think we'll
go that route.  Anyway, I'm looking into some better cruft cleanup in
our rc.S (runlevel 1) script.

> > Unfortunately, neither I nor the reporter have been able to
> > reproduce it since that one instance, so I'm at a loss on how to
> > further troubleshoot this.  Has anyone else had this happen, and if
> > so, did you figure out what was causing it?  
> 
> It used to happen to me all the time while writing the locking code.
> I haven't seen it happen since we released 1.1.0, though. :)

Interestingly enough, it *just* happened to me again today, and the
cause is indeed a failed resume.  I did a suspend to disk last night
before going to bed, and when I powered up this morning, I got a fresh
instance of the OS.  I have no idea why that happened, as s2disk has
always worked flawlessly here, but I think this is the first time I've
actually done it since I've been running 2.6.27.4.  No time to try
debugging right now though, so don't put any brain cycles into it -
I'll work on that later :-)

-RW
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 197 bytes
Desc: not available
Url : http://lists.freedesktop.org/archives/pm-utils/attachments/20081107/1cd7f436/attachment.pgp