[systemd-devel] Need help with a systemd/mdadm interaction.

NeilBrown neilb at suse.de
Mon Nov 18 22:33:43 PST 2013


On Thu, 14 Nov 2013 11:23:30 +0600 "Alexander E. Patrakov"
<patrakov at gmail.com> wrote:

> NeilBrown пишет:
> > On Wed, 13 Nov 2013 22:11:27 +0600 "Alexander E. Patrakov"
> > <patrakov at gmail.com> wrote:
> >
> >> 2013/11/13 NeilBrown <neilb at suse.de>:
> >>> On Tue, 12 Nov 2013 19:01:49 +0400 Andrey Borzenkov <arvidjaar at gmail.com>
> >>> wrote:
> >>>
> >>>> Something like
> >>>>
> >>>> mdadm-last-resort at .timer
> >>>>
> >>>> [Timer]
> >>>> OnCalendar=+5s
> >>>>
> >>>> mdadm-last-resort at .service
> >>>>
> >>>> [Service]
> >>>> Type=oneshot
> >>>> ExecStart=/sbin/mdadm -IRs %n
> >>>>
> >>>> udev rule
> >>>>
> >>>> ... SYSTEMD_WANTS=mdadm-last-resort@$ENV{SOMETHING_UNIQUE}.timer
> >>>>
> >>> Thanks.  This certainly looks interesting and might be part of a solution.
> >>> However it gets the timeout test backwards.
> >>>
> >>> I don't want to set the timeout when the array starts to appear.  I want to
> >>> set the time out when someone wants to use the array.
> >>> If no-one is waiting for the array device, then there is no point forcing it.
> >>>
> >>> That's why I want to plug into the timeout that systemd already has.
> >>>
> >>> Maybe that requirement isn't really necessary though.  I'll experiment with
> >>> your approach.
> >> It is useless to even try to plug into the existing systemd timeout,
> >> for a very simple reason. in the setups where your RAID array is not
> >> on the top of the storage device hierarchy, systemd does not know that
> >> it wants your RAID array to appear.
> >>
> >> So the statement "If no-one is waiting for the array device, then
> >> there is no point forcing it" is false, because there is no way to
> >> know that no-one is waiting.
> >>
> > "useless" seems a bit harsh.  "not-optimal" may be true.
> >
> > If systemd was waiting for a device, then it is clear that something was
> > waiting for something.  In this case it might be justified to activate as
> > much as possible in the hope that the important things will get activated.
> > This is what "mdadm -IRs" does.  It activates all arrays that are still
> > inactive but have enough devices to become active (though degraded).  It
> > isn't selective.
> > If they are deep in some hierarchy, then udev will pull it all together and
> > the root device will appear.
> >
> > If systemd is not waiting for a device, then there is no justification for
> > prematurely starting degraded arrays.
> >
> >
> > Maybe I could get emergency.service to run "mdadm -IRs" and if that actually
> > started anything, then to somehow restart local-fs.target.  Might that be
> > possible?
> 
> If this is the way forward, then we, obviously, should think about a 
> general mechanism that is useful not only for mdadm, but also to other 
> layered storage implementations such as dm-raid, or maybe multi-device 
> btrfs, and that is useful if more than one of these technologies are 
> used on top of each other. This by necessity leads to multiple emergency 
> missing-device handlers. And then a question immediately appears, in 
> which order the emergency handlers should be tried, because all that is 
> known at the time of emergency is that some device listed in /etc/fstab 
> is missing. I suspect that the answer is "in arbitrary order" or even 
> "in parallel", but then there is a chance that one run of all of them 
> will not be enough.
> 
> This is not a criticism, just something to be fully thought out before 
> starting an implementation.
> 

Good points.
However dmraid doesn't have very good support for RAID arrays that actually
have redundancy.  It is fine for RAID0 but only really supports RAID1 in ddf
and Intel's IMSM and mdadm has better support for those.  So dmraid doesn't
really figure much here.

And I suspect that btrfs would end up being a very different situation
because it vertically integrates the filesystem with the RAID layer.


My ideal solution would be for mdadm to assemble a degraded array as soon as
enough devices are available, but mark it soft-read-only.  When all the
expected disks arrive it would switch to read-write.

systemd would see this soft-read-only state and wait a bit longer for it to
become read-write.  If that didn't happen in time, it would force it to be
read-write.

Every stacked device would need to understand this soft-read-only state and
would set itself to soft-read-only if any component was soft-read-only, or
was missing.  It would notice changes in component devices and propagate them
up.

When the top device was forced to read-write, this would propagate all the
way down.

This would avoid the guess work an "emergency catch-all" ugliness and yield a
fairly clean and predictable result.

If anything tried to write to a soft-read-only device, that would have the
same effect as forcing to read-write.


This way the whole storage stack would be plumbed at the earliest possible
moment, but would not generate any writes.  When systemd discovers the
device, if it notices that it is "soft-read-only", it could then wait a
timeout before forcing the device to read-write.  That would propagate down
the stack forcibly activating any degraded arrays.

That would require a fair bit of plumbing that but I think it might be worth
it as it provides a fairly general solution.

One problem is that if one half of a RAID1 had failed but is working again on
reboot, and if it is noticed first then the array could initially be
assembled soft-read-only with old data and then when the newer device is
found, the content would suddenly change.  That could confuse filesystems so
some sort of "I've changed content" signal would need to be plumbed as well...

I think I'll go ahead with a timer-based approach like the one described by
Andrey, and think more about creating a "soft-read-only" state for storage
stacks that are as yet incomplete, but can be made complete.

Thanks,
NeiilBrown
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 828 bytes
Desc: not available
URL: <http://lists.freedesktop.org/archives/systemd-devel/attachments/20131119/af78dab3/attachment.pgp>


More information about the systemd-devel mailing list