Mail archive

Re: [alpine-devel] udev replacement on Alpine Linux

From: Jude Nelson <>
Date: Thu, 14 Jan 2016 00:55:57 -0500

Hi Laurent,

On Wed, Jan 13, 2016 at 7:33 AM, Laurent Bercot <>

> On 13/01/2016 04:47, Jude Nelson wrote:
> I haven't tried this myself, but it should be doable. Vdev's
>> event-propagation mechanism is a small program that constructs a
>> uevent string from environment variables passed to it by vdev and
>> writes the string to the appropriate place. The vdev daemon isn't
>> aware of its existence; it simply executes it like it would for any
>> another matching device-event action. Another device manager could
>> supply the same program with the right environment variables and use
>> it for the same purposes.
> Indeed. My question then becomes: what are the differences between
> the string passed by the kernel (which is more or less a list of
> environment variables, too) and the string constructed by vdev ?
> In other words, is vdev itself more than a trivial netlink listener,
> and if yes, what does it do ? (I'll just take a pointer to the
> documentation if that question is answered somewhere.)
> For now I'll take a wild guess and say that vdev analyzes the
> MODALIAS or something, according to a conf file, in order to know
> the correct fan-out to perform and write the event to the correct
> subsystems. Am I close ?

(I should really sit down and write documentation sometime :)

I think you're close. The jist of it is that vdev needs to supply a lot
more information than the kernel gives it. In particular, its helper
programs go on to query the properties and status of each device (this
often requires root privileges, i.e. via privileged ioctl()s), and vdev
gathers the information into a (much larger) event packet and stores it in
a directory tree under /dev for subsequent query by less-privileged
programs. It doesn't rely on the MODALIAS per se; instead it matches
fields of the kernel's uevent packet (one of which is the MODALIAS) to the
right helper programs to run.

Here's an example of what vdev gathers for my laptop's SATA disk:

$ cat /dev/metadata/dev/sda/properties

Anything that starts with "VDEV_ATA_", as well as "VDEV_BUS",
"VDEV_SERIAL_*", "VDEV_TYPE", and "VDEV_REVISION" had to be extracted via
an ioctl, by exploring files in sysfs, or by querying a hardware database.
The kernel only supplied a few of these fields.

> Tmpfs and devtmps are designed for holding ephemeral state already,
>> so I'm not sure why the fact that they expose data as regular files
>> is a concern?
> Two different meanings of "ephemeral".
> tmpfs and devtmpfs are supposed to retain their data until the
> end of the system's lifetime. An event is much more ephemeral
> than that: it's supposed to be consumed instantly - like the
> event from the kernel is consumed instantly by the netlink listener.
> Files, even in a tmpfs, remain alive in the absence of a live
> process to hold them; but events have no meaning if no process needs
> them, which is the reason for the "event leaking" problem.
> Ideally, you need a file type with basically the same lifetime
> as a process.

> Holding event data in a file is perfectly valid as long as you have
> a mechanism to reclaim the file as soon as the last reference to it
> dies.

Funny you mention this--I also created runfs ( to do exactly this. In particular, I
use it for PID files. Also, eventfs was actually derived from runfs, but
specialized more to make it more suitable for managing event-queues.

> I couldn't think of a simpler way that was also as robust. Unless
>> I'm misunderstanding something, wrapping an arbitrary program to
>> clean up the files it created would, in the extreme, require coming
>> up with a way to do so on SIGKILL. I'd love to know if there is a
>> simple way to do this, though.
> That's where supervisors come into play: the parent of a process
> always knows when it dies, even on SIGKILL. Supervised daemons can
> have a cleaner script in place.
> For the general case, it shouldn't be hard to have a wrapper that
> forks an arbitrary program and cleans up /dev/metadata/whatever/*$childpid*
> when it dies. The price to pay is an additional process, but that
> additional process would be very small.
> You can still have a polling "catch-all cleaner" to collect dead events
> in case the supervisor/wrapper also died, but since that occurrence will
> be rare, the polling period can be pretty long so it's not a problem.

Agreed. I would be happy to keep this approach in mind in the design of
libudev-compat. Eventfs isn't a hard requirement and I don't want it to
be, since there's more than one way to deal with this problem.

> I went with a specialized filesystem for two reasons; both of which
>> were to fulfill libudev's API contract: * Efficient, reliable event
>> multicasting. By using hard-links as described above, the event only
>> needs to be written out once, and the OS only needs to store one
>> copy.
> That's a good mechanism; you're already fulfilling that contract
> with the non-eventfs implementation.
> * Automatic multicast channel cleanup. Eventfs would ensure that no
>> matter how a process dies, its multicast state would be come
>> inaccessible and be reclaimed once it is dead (i.e. a subsequent
>> filesystem operation on the orphaned state, no matter how soon after
>> the process's exit, will fail).
> That's where storing events as files is problematic: files survive
> processes. But I still don't think a specific fs is necessary: you can
> either ensure files do not survive processes (see the supervisor/cleaner
> idea above), or you can use another Unix mechanism (see below).
> Both of the above are implicitly guaranteed by libudev, since it
>> relies on a netlink multicast group shared with the udevd process
>> to achieve them.
> And honestly, that's not a bad design. If you want to have multicast,
> and you happen to have a true multicast IPC mechanism, might as well
> use it. It will be hard to be as efficient as that: if you don't have
> true multicast, you have to compromise somewhere.
> I dare say using a netlink multicast group is lighter than designing
> a FUSE filesystem to do the same thing. If you want the same
> functionality, why didn't you adopt the same mechanism ?

I agree that netlink is lighter, but I avoided it for two reasons:
* Sometime down the road, I'd like to port vdev to OpenBSD. Not because I
believe that the OpenBSD project is in dire need of a dynamic device
manager, but simply because it's the thing I miss the most when I'm using
OpenBSD (personal preference). Netlink is Linux-specific, whereas FUSE
works on pretty much every Unix these days.
* There is no way to namespace netlink messages that I'm aware of. The
kernel (and udev) sends the same device events to every container on the
system--in fact, this is one of the major reasons cited by the systemd
folks for moving off of netlink for udevd-to-libudev communications. By
using a synthetic filesystem for message transport, I can use bind-mounts
to control which device events get routed to which containers (this is also
the reason why the late kdbus was implemented as a synthetic filesystem).
Using fifodirs has the same benefit :)

> (It can be made modular. You can have a uevent listener that just gets
> the event from the kernel and transmits it to the event manager; and
> the chosen event manager multicasts it.)
Good point; something I'll keep in mind in the future evolution of
libudev-compat :)

> It is my understanding (please correct me if I'm wrong) that with
>> s6-ftrig-*, I would need to write out the event data to each
>> listener's pipe (i.e. once per struct udev_monitor instance), and I
>> would still be responsible for cleaning up the fifodir every now and
>> then if the libudev-compat client failed to do so itself. Is my
>> understanding correct?
> Yes and no. I'm not suggesting you to use libftrig for your purpose. :)
> * My concern with libftrig was never event storage: it was
> many-to-many notification. I didn't design it to transmit arbitrary
> amounts of data, but to instantly wake up processes when something
> happens; data transmission *is* possible, but the original idea is
> to send one byte at a time, for just 256 types of event.
> Notification and data transmission are orthogonal concepts. It's
> always possible to store data somewhere and notify processes that
> data is available; then processes can fetch the data. Data
> transmission can be pull, whereas notification has to be push.
> libftrig is only about the push.
> Leaking space is not a concern with libftrig, because fifodirs
> never store data, only pipes; at worst, they leak a few inodes.
> That is why a polling cleaner is sufficient: even if multiple
> subscribers get SIGKILLed, they will only leave behind a few
> fifos, and no data - so sweeping now and then is more than enough.
> It's different if you're storing data, because leaks can be much
> more problematic.
> * Unless you have true multicast, you will have to push a
> notification as many times as you have listeners, no matter what.
> That's what I'm doing when writing to all the fifos in a fifodir.
> That's what you are doing when linking the event into every
> subscriber's directory. I guess your subscriber library uses some
> kind of inotify to know when a new file has arrived?

Yes, modulo some other mechanisms to ensure that the libudev-compat process
doesn't get back-logged and lose messages. I completely agree with you
about the benefits of separating notification (control-plane) from message
delivery (data-plane).

> Again, I would love to know of a simpler approach that is just as
>> robust.
> Whenever you have "pull" data transmission, you necessarily have the
> problem of storage lifetime. Here, as often, what you want is
> reference counting: when the last handle to the data disappears, the data
> is automatically collected.
> The problem is that your current handle, an inode, is not tied to the
> subscriber's lifetime. You want a type of handle that will die with the
> process.
> File descriptors fit this.
> So, an idea would be to do something like:
> - Your event manager listens to a Unix domain socket.
> - Your subscribers connect to that socket.
> - For every event:
> + the event manager stores the event into an anonymous file (e.g. a file
> in a tmpfs that is unlinked as soon as it is created) while keeping a
> reading fd on it
> + the event manager sends a copy of the reading fd, via fd-passing,
> to every subscriber. This counts as a notification, since it will wake up
> subscribers.
> + the event manager closes its own fd to the file.
> + subscribers will read the fd when they so choose, and they will
> close it afterwards. The kernel will also close it when they die, so you
> won't leak any data.
> Of course, at that point, you may as well give up and just push the
> whole event over the Unix socket. It's what udevd does, except it uses a
> netlink multicast group instead of a normal socket (so its complexity is
> independent from the number of subscribers). Honestly, given that the
> number of subscribers will likely be small, and your events probably aren't
> too large either, it's the simplest design - it's what I'd go for.
> (I even already have the daemon to do it, as a part of skabus. Sending
> data to subscribers is exactly what a pubsub does.)
> But if you estimate that the amount of data is too large and you don't
> want to copy it, then you can just send a fd instead. It's still
> manual broadcast, but it's not in O(event length * subscribers), it's in
> O(subscribers), i.e. the same complexity as your "hard link the event
> file" strategy; and it has the exact storage properties that you want.
> What do you think ?

I think both approaches are good ideas and would work just as well. I
really like skabus's approach--I'll take a look at using it for message
delivery as an additional (preferred?) vdev-to-libudev-compat message
delivery mechanism :) It looks like it offers all the aforementioned
benefits over netlink that I'm looking for.

A question on the implementation--what do you think of having each
subscriber create its own Unix domain socket in a canonical directory, and
having the sender connect as a client to each subscriber? Since each
subscriber needs its own fd to read and close, the directory of subscriber
sockets automatically gives the sender a list of who to communicate with
and a count of how many fds to create. It also makes it easy to detect and
clean up a dead subscriber's socket: the sender can request a struct ucred
from a subscriber to get its PID (and then other details from /proc), and
if the process ever exits (which the sender can detect on Linux using a
netlink process monitor, like [1]), the process that created the socket can
be assumed to be dead and the sender can unlink it. The sender would rely
on additional process instance-identifying information from /proc (like its
start-time) to avoid PID-reuse races.

Thanks again for all your input!


Received on Thu Jan 14 2016 - 00:55:57 UTC