Mail archive
alpine-devel

Re: [alpine-devel] udev replacement on Alpine Linux

From: Jude Nelson <judecn_at_gmail.com>
Date: Sat, 16 Jan 2016 12:48:10 -0500

Hi Laurent, apologies for the delay,

On Thu, Jan 14, 2016 at 6:36 AM, Laurent Bercot <ska-devel_at_skarnet.org>
wrote:

> On 14/01/2016 06:55, Jude Nelson wrote:
>
>> I think you're close. The jist of it is that vdev needs to supply a
>> lot more information than the kernel gives it. In particular, its
>> helper programs go on to query the properties and status of each
>> device (this often requires root privileges, i.e. via privileged
>> ioctl()s), and vdev gathers the information into a (much larger)
>> event packet and stores it in a directory tree under /dev for
>> subsequent query by less-privileged programs.
>>
>
> I see.
> I think this is exactly what could be made modular. I've heard
> people say they were reluctant to using vdev because it's not KISS, and
> I suspect the ioctl machinery and data gathering is a large part of
> the complexity. If that part could be pluggable, i.e. if admins could
> choose a "data gatherer" just complex enough for their needs, I believe
> it could encourage adoption. In other words, I'm looking at a 3-part
> program:
> - the netlink listener
> - the data gatherer
> - the event publisher


> Of course, for libudev to work, you would need the full data gatherer;
> but if people aren't using libudev programs, they can use a simpler one,
> closer to what mdev is doing.

 It's all from a very high point-of-view, and I don't know the details of
> the code so I have no idea whether it's envisionable for vdev, but that's
> what I'm thinking off the top of my head.


This sounds reasonable. In fact, within vdevd there are already distinct
netlink listener and data gatherer threads that communicate over a
producer/consumer queue. Splitting them into separate processes connected
by a pipe is consistent with the current design, and would also help with
portability.


>
>
>
> Funny you mention this--I also created runfs
>> (https://github.com/jcnelson/runfs) to do exactly this. In
>> particular, I use it for PID files.
>>
>
> I have no love for mechanisms that help people keep using PID files,
> which are an ugly relic that can't end up in the museum of mediaeval
> programming soon enough. :P
>

Haha, true. I have other purposes for it though.

 That said, runfs is interesting, and I would love it if Unix provided
> such a mechanism. Unfortunately, for now it has to rely on FUSE, which
> is one of the most clunky mutant features of Linux, and an extra layer
> of complexity; so I find it cleaner if a program can achieve its
> functionality without depending on such a filesystem.
>
>
I think this is one of the things Plan 9 got right--letting a process
expose whatever fate-sharing state it wanted through the VFS. I agree that
using FUSE to do this is a lot clunkier, but I don't think that's FUSE's
fault. As far as I know, Linux doesn't allow a process to expose custom
state through /proc.


>
> I agree that netlink is lighter, but I avoided it for two reasons:
>> * Sometime down the road, I'd like to port vdev to OpenBSD.
>>
>
> That's a good reason, and an additional reason to separate the
> netlink listener from the event publisher (and the data gatherer).
> The event publisher and client library can be made 100% portable,
> whereas the netlink listener and data gatherer obviously cannot.
>
>
> * There is no way to namespace netlink messages that I'm aware of.
>>
>
> I didn't know that - I'm no netlink expert. But that's also a good
> reason. AFAICT, there are 32 netlink multicast groups, and they use
> hardcoded numbers - this is ugly, or at least requires a global
> registry of what group is used for. If you can't namespace them, it
> becomes even more of a scarce resource; although it's legitimate to
> use one for uevent publishing, I'm pretty sure people will find a way
> to clog them with random crap very soon - better stay away from
> resources you can't reliably lock. And from what you're saying, even
> systemd people have realized that. :)
>
> I'm not advocating netlink use for anything else than reading kernel
> events. It's just that true multicast will be more efficient than manual
> broadcast, there's no way around it.
>
>
> By using a synthetic filesystem for
>> message transport, I can use bind-mounts to control which device
>> events get routed to which containers
>>
>
> I'm torn between "oooh, clever" and "omg this hack is atrocious". :)
>
>
Haha, thanks :)


>
> Yes, modulo some other mechanisms to ensure that the libudev-compat
>> process doesn't get back-logged and lose messages.
>>
>
> What do you mean by that?
> If libudev-compat is, like libudev, linked into the application, then
> you have no control over client behaviour; if a client doesn't properly
> act on a notification, then there's nothing you can do about it and
> it's not your responsibility. Can you give a few details about what
> you're doing client-side?
>
>
A bit of background:
* Unlike netlink sockets, a program cannot control the size of an inotify
descriptor's "receive" buffer. This is a system-wide constant, defined
in /proc/sys/fs/inotify/max_queued_events. However, libudev offers clients
the ability to do just this (via udev_monitor_set_receive_buffer_size).
This is what I originally meant--libudev-compat needs to ensure that the
desired receive buffer size is honored.
* libudev's API exposes the udev_monitor's netlink socket descriptor
directly to the client, so it can poll on it (via udev_monitor_get_fd).
* libudev allows clients to define event filters, so they receive only the
events that they want to receive (via udev_monitor_filter_*). The
implementation achieves this by translating filters into BPF programs, and
attaching them to the client's netlink socket. It is also somewhat
complex, and I didn't want to have to re-write it each time I sync'ed the
code with upstream.

To work around these constraints, libudev-compat routes a udev_monitor's
events through an internal socket pair. It uses inotify as an edge-trigger
instead of a level-trigger: when there is at least one file to consume
from the event directory, it will read as many files as it can and try to
saturate the struct udev_monitor's socket pair (the number of bytes the
socketpair can hold now gets controlled by
udev_monitor_set_receive_buffer_size). The receive end of the socket pair
and the inotify descriptor are unified into a single pollable epoll
descriptor, which gets returned via libudev-compat's udev_monitor_get_fd
(it will poll as ready if either there are unconsumed events in the socket
pair, or a new file has arrived in the directory). The filtering
implementation works almost unmodified, except that it attaches BPF
programs to the udev_monitor's socket pair's receiving end instead of a
netlink socket.

In summary, the system doesn't try to outright prevent event loss for
clients; it tries to ensure the clients can control their receive-buffer
size, with expected results. One of the more subtle reasons for using
eventfs is that it makes it possible to control the maximum number of bytes
an event directory can hold. By making this work on a per-directory basis,
the system retains the ability to control on a per-monitor basis the
maximum number of events it will hold before NACKing the event-pusher.
The udev_monitor_set_receive_buffer_size would also set the upper
byte-limit value for its udev_monitor's event directory, thereby retaining
the original API contract.


>
> I think both approaches are good ideas and would work just as well.
>> I really like skabus's approach--I'll take a look at using it for
>> message delivery as an additional (preferred?) vdev-to-libudev-compat
>> message delivery mechanism :) It looks like it offers all the
>> aforementioned benefits over netlink that I'm looking for.
>>
>
> Unfortunately, it's not published yet, because there's still a lot
> of work to be done on clients. And now I'm wondering whether it would
> be more efficient to store messages in anonymous files and transmit
> fds, instead of transmitting copies of messages. I may have to rewrite
> stuff. :)
> I think I'll be able to get back to work on skabus by the end of this
> year - but no promises, since I'll be working on the Alpine init system
> as soon as I'm done with my current contract. But I can leak a few
> pieces of source code if you're interested.
>
>
I'd be willing to take a crack at it, if I have time between now and the
end of the year. I'm trying to finish my PhD this year, which is why vdev
development has been slow-going for the past several months. Will keep you
posted :)


>
> A question on the implementation--what do you think of having each
>> subscriber create its own Unix domain socket in a canonical
>> directory, and having the sender connect as a client to each
>> subscriber?
>>
>
> That's exactly how fifodirs work, with pipes instead of sockets.
> But I don't think that's a good fit here.
>
> A point of fifodirs is to have many-to-many communication: there
> are several subscribers, but there can also be several publishers
> (even if in practice there's often only one publisher). Publishers and
> subscribers are completely independent.
> Here, you only ever have one publisher: the event dispatcher. You
> only ever need one-to-many communication.
>
> Another point of fifodirs is to avoid the need for a daemon to act
> as a bus. It's notification that happens between unrelated processes
> without requiring a central server to ensure the communication.
> It's important because I didn't want my supervision system (which is
> supposed to manage daemons) to itself rely on a daemon (which would
> then have to be unsupervised).
> Here, you don't have that requirement, and you already have a daemon:
> the event dispatcher is long-lived.

 I think a "socketdir" mechanism is just too heavy:
> - for every event, you perform opendir(), readdir() and closedir()

 - for every event * subscriber, you perform at least socket(), connect(),
> sendmsg() and close()

 - the client library needs to listen() and accept(), which means it
> needs its own thread (and I hate, hate, hate, libraries that pull in
> thread support in my otherwise single-threaded programs)
> - the client library needs to perform access control on the socket,
> to avoid connects from unrelated processes, and even then you can't
> be certain it's the event publisher and not a random root process
>
> You definitely don't want a client library to be listen()ing.
> listen() is server stuff - mixing client and server stuff is complex.
> Too much so for what you need here.


> Since each subscriber needs its own fd to read and
>> close, the directory of subscriber sockets automatically gives the
>> sender a list of who to communicate with and a count of how many fds
>> to create. It also makes it easy to detect and clean up a dead
>> subscriber's socket: the sender can request a struct ucred from a
>> subscriber to get its PID (and then other details from /proc), and if
>> the process ever exits (which the sender can detect on Linux using a
>> netlink process monitor, like [1]), the process that created the
>> socket can be assumed to be dead and the sender can unlink it. The
>> sender would rely on additional process instance-identifying
>> information from /proc (like its start-time) to avoid PID-reuse
>> races.
>>
>
> Bleh. Of course it can be made to work, but you really don't need all
> that complexity. You have a daemon that wants to publish data, and
> several clients that want to receive data from that daemon: it's
> one (long-lived) to many (short-lived) communication, and there's a
> perfectly appropriate, simple and portable IPC for that: a single Unix
> domain socket that your daemon listens on and your clients connect to.
> If you want to be perfectly reliable, you can implement some kind of
> autoreconnect in the client library - in case you want to restart the
> event publisher without killing X, for instance. But that's still a
> lot simpler than playing with multiple sockets and mixing clients and
> serverswhen you don't need to.


Agreed--if the event dispatcher is going to be a message bus, then a lot of
the aforementioned difficulties can be eliminated by design. But I'm
uncomfortable with the technical debt it can introduce to the
ecosystem--for example, a message bus has its own semantics that
effectively require a bus-specific library, clients' design choices can
require a message bus daemon to be running at all times, pervasive use of
the message bus by system-level software can make the implementation a hard
requirement for having a usable system, etc. (in short, we get dbus
again). By going with filesystem-oriented approach, this risk is averted,
since the filesystem interface is well-understood, universally supported,
and somewhat future-proof. Most programs can use it without being aware of
the fact.


>
>
>
> Thanks again for all your input!
>>
>
> No problem. I love design discussions, I can't get enough of them.
> (The reason why I left the Devuan mailing-list is that there was too
> much ideological mumbo-jumbo, and not enough technical/design stuff.
> Speaking of which, my apologies to Alpine devs for hijacking their ML;
> if it's too OT/uninteresting, we'll take the discussion elsewhere.)


Happy to move offline, unless the Alpine devs still want to be CC'ed :)

-Jude


> ---
> Unsubscribe: alpine-devel+unsubscribe_at_lists.alpinelinux.org
> Help: alpine-devel+help_at_lists.alpinelinux.org
> ---
>
>



---
Unsubscribe:  alpine-devel+unsubscribe_at_lists.alpinelinux.org
Help:         alpine-devel+help_at_lists.alpinelinux.org
---
Received on Sat Jan 16 2016 - 12:48:10 GMT