[systemd-devel] RFC: one more time: SCSI device identification

Wed Apr 28 06:30:28 UTC 2021

On Tue, 2021-04-27 at 16:41 -0400, Ewan D. Milne wrote:
> On Tue, 2021-04-27 at 20:33 +0000, Martin Wilck wrote:
> > On Tue, 2021-04-27 at 16:14 -0400, Ewan D. Milne wrote:
> > > 
> > > There's no way to do that, in principle.  Because there could be
> > > other I/Os in flight.  You might (somehow) avoid retrying an I/O
> > > that got a UA until you figured out if something changed, but other
> > > I/Os can already have been sent to the target, or issued before you
> > > get to look at the status.
> > 
> > Right. But in practice, a WWID change will hardly happen under full
> > IO
> > load. The storage side will probably have to block IO while this
> > happens, at least for a short time period. So blocking and quiescing
> > the queue upon an UA might still work, most of the time. Even if we
> > were too late already, the sooner we stop the queue, the better.
> > 
> > The current algorithm in multipath-tools needs to detect a path going
> > down and being reinstated. The time interval during which a WWID
> > change
> > will go unnoticed is one or more path checker intervals, typically on
> > the order of 5-30 seconds. If we could decrease this interval to a
> > sub-
> > second or even millisecond range by blocking the queue in the kernel
> > quickly, we'd have made a big step forward.
> 
> Yes, and in many situations this may help.  But in the general case
> we can't protect against a storage array misconfiguration,
> where something like this can happen.  So I worry about people
> believing the host software will protect them against a mistake,
> when we can't really do that.

I agree. I expressed a similar notion in the following thread about
multipathd's WWID change detection capabilities in the face of really
bad mistakes on the administrator's (or storage array's, FTM)  part:
https://listman.redhat.com/archives/dm-devel/2021-February/msg00248.html
But others stressed that nonetheless we should try our best to
avoid customer data corruption (which I agree with, too), and thus we
settled on the current algorithm, which suited the needs at least of
the affected user(s) in that specific case.

Personally I think that the current "5-30s" time period for WWID change
detection in multipathd is unsafe both theoretically and practially,
and may lure users into a false feeling of safety. Therefore I'd
strongly welcome a kernel-side solution that might still not be safe
theoretically, but cover most practical problem scenarios much better
than we currently do.

Regards
Martin

-- 
Dr. Martin Wilck <mwilck at suse.com>, Tel. +49 (0)911 74053 2107
SUSE Software Solutions Germany GmbH
HRB 36809, AG Nürnberg GF: Felix Imendörffer