[lvm-team] Soliciting feature requests for development of an LVM library / API

Mon Dec 15 12:00:41 PST 2008

On Sat, 2008-12-13 at 15:23 +0000, Alasdair G Kergon wrote:
> On Sat, Dec 13, 2008 at 03:28:20PM +0100, Kay Sievers wrote:
> > On Sat, Dec 13, 2008 at 02:30, Dave Wysochanski <dwysocha at redhat.com> wrote:
> > > Note that by scanning a single PV we may not be able to answer
> > > definitively what VG it is in or what LVs it may contain.  So provided
> > > we go this route with putting into the udev database all needed
> > > information that today you get from pv/vg/lvdisplay commands, something
> > > else will have to put together individual PV information to get an
> > > accurate picture of the VG / LV info in the system.
>  
> When we discussed this with David in Boston, it became clear that the
> udev database as currently implemented was unable to hold the information
> LVM2 needs to store.
> 
> So we started going back to the idea that LVM2 would still needs its
> own database (an extension of /etc/lvm/cache described on a bugzilla 
> somewhere IIRC).  Then I was wondering about integrating with upstart 
> to handle the rule-based 'triggers'.
> 
> Alternatively we need to find a way to extend the udev database.

I'm not sure we want to extend the udev database; in my view it's
supposed to be a small and efficient mechanism that allows one to
annotate directories in /sys with additional information that we collect
in user space.

As an aside, I think the udev database is more than adequate for this
thing if we just choose the right format we want to use to convey this
information (see below).

FWIW, in many ways, one can think of the udev database as an extension
of sysfs insofar that attributes in sysfs represents information / state
exported by the kernel driver while attributes in the udev database
represents information / state exported by a user space programs /
daemons.

The useful thing about such a view is that the rest of user space can
treat a) sysfs attributes; and b) attributes in the udev database;
exactly the same way. In particular, this makes it reasonably easy for
the rest of the OS to handle when something in user space is moved into
the kernel and vice versa: instead of opening a sysfs file, you simply
just query for the udev attribute (nd for transitional periods, udev
rules can be used to easily mirror a sysfs value in an udev attribute).

> > For various reasons we can not afford to open any "unknown" device
> > from these tools or the library, 
> 
> Indeed - that's what the 'trustcache' setting I added a while back was
> meant for.
> 
> Example walk through:
> /dev/sda appears
> udev notified
> - calls out to md 'is this yours?' - md says 'no'
> - calls out to lvm2 'is this yours?' 
>   - lvm2 sees that it is a PV
>       - stores the basic PV label info in a database somewhere
>     lvm2 sees that the PV contains VG metadata
>       - reads the VG metadata - it references 1 other PV UUID
>       - *stores* the vg name, vg uuid, plus PV UUIDs references (1 of which can't yet be resolved to a device) in a database somewhere
>     lvm2 responsds "yes" ie claims the device.  udev stops asking other subsystems
> 
> /dev/sdb appears
> udev notified
> - calls out to md 'is this yours?' - md says 'no'
> - calls out to lvm2 'is this yours?'
>   - lvm2 sees that it is a PV
>     - stores the basic PV label info in the database
>   - lvm2 sees that the PV contains VG metadata (nothing to do - already in the database)
>   - lvm2 responsds "yes" ie claims the device.  udev stops asking other subsystems
> A trigger (Upstart?) notices that all the PVs making up a VG are present and that there
> is a rule saying that if this particular VG is seen, all the LVs in it should be activated
> - vgchange -ay  runs on that VG
> 1 LV appears /dev/vg0/lv0
> udev notificed
> - calls out to md 'is this yours?' - md says 'no'
> - calls out to lvm2 'is this yours?' lvm2 says 'no'
> - calls out to mount 'is this yours?'
>   - mount sees there's a (legacy) fstab entry for it 
>     - there's a filesystem on it so it mounts it as per rule
> 
> Exactly what 'owns' the database, rule engine and triggers makes little
> difference to me - but this is the sort of event-driven functionality I'm
> expecting to see when this project is complete.
> The bit I'm told the udev database can't handle is storing the VG information
> with indexes independent of the PVs.

I think you're conflating a lot of things here. I think a top-down
approach be useful:

 1. First we need to agree what information we want in the udev database
    for device-mapper and LVM. This includes rigorously defining what
    attributes are put in the udev database and documenting it (ideally
    in a man page).

    In a way this will form a "contract", an ABI if you want, that the
    device-mapper and LVM tools will guarantee to provide.

    (It's implied here that we also need to agree upfront that it's not
     enough for device-mapper and LVM to just provide tools to this; we
     *need* device-mapper and LVM to *ship* udev rules that actually
     populates the udev database with this information. Because without
     this things like DeviceKit-disks and other projects are forced
     to add their own rules and then things gets racy and works get
     duplicated.)

 2. Then we can discuss what's the best way to implement this is. This
    falls squarely into the device-mapper domain; e.g. you can do this
    whatever way you want. There's of course practical constraints
    meaning that the udev rules you add shouldn't consume a lot of
    resources.

    (IOW, while it's interesting that dmsetup now takes new parameters
     that easily allows it to be used from a simple udev rule it's in
     other ways not very interesting: it's just an implementation detail
     on how you populate the udev database which is something that
     you, and not udev or DeviceKit-disks developers, should care
     about.)

 3. Finally we can discuss how this information can be used to implement
    policy by writing very simple udev rules that leverages the info in
    the udev database defined in 1.

    For example, one thing many people (desktop developers like me
    but also people working on initramfs/booting (jeremy, davej)
    and also anaconda) probably want to request is that device-mapper
    and LVM ship with udev rules that uses the information defined in 1.
    above to implement a policy that automatically assembles LV's from
    PV's. If we solve 1. correctly, this *shouldn't* be more complicated
    than a simple one-liner udev rule.

    This is just one example. My point is that when we work on solving
    problem 1. above we need to have such things in mind (there are
    also other policies we want to implement from data in 1. that I
    haven't mentioned here.)

OK, so this is an attempt to split the bigger problem into smaller parts
(divide and conquer). How about we start talking about topic 1. first
and then we can talk about implementations details and policy once we've
figured out what we want in the udev database?

For 1. I expect that we want to attach udev attributes to

 a. Things we detect as LVM Physical Volumes (e.g. /dev/sda1)

 b. Mapped devices (e.g. /dev/dm-0)
    - In the end of this mail
http://lists.freedesktop.org/archives/devkit-devel/2008-December/000070.html
      there's one proposal for this module the name space.

 c. Anything else?

Alasdair, since you are the domain expert here, how about you come up
with a proposal on concrete udev attributes (keeping feature requests
like those in 3. above and ohters in mind) that device-mapper / LVM
could put in the udev database for a) and b)?

    David

ps. : As an aside, I *think* that once we agree on 1. (e.g. the data
that device-mapper / LVM will provide in the udev database) it should be
straightforward for you to implement this in a naive way where e.g. the
device-mapper and LVM tools opens all block devices to provide the
needed information. 

Then you can work on optimizing this (e.g. you "trust cache" setting)
while early adopters (like DeviceKit-disks, mkinitrd, anaconda etc.) can
start using it.