Subject: Re: [alpine-devel] udev replacement on Alpine Linux
To: alpine-devel@lists.alpinelinux.org
References: <CAO6-m02qfTUZeW9a0+WZY9eqEgnr50T27m4vPy5Y6unc64kNeA@mail.gmail.com>
 <20150727103737.4f95e523@ncopa-desktop.alpinelinux.org>
 <20150728052436.GC1923@newbook>
 <CAO6-m02gSo0CeP7F=gTpsMukiNj+CUTM+FkJG86k-kGU=ZjVFA@mail.gmail.com>
 <20160112153804.GI32545@example.net> <56953ABE.5090203@skarnet.org>
 <CAFsQEP2Cx23cGcCTZ8e1RK+LOPtbvi=H6xUwuPH=9WqFQYg51A@mail.gmail.com>
 <56958E22.90806@skarnet.org>
 <CAFsQEP3K3HeP8dTx_Mi=hJYFDg9o3WmoYPsFcEN76z1pupR4pw@mail.gmail.com>
 <56964414.1000605@skarnet.org>
 <CAFsQEP0_iCUyuJgXbpDXORMry_SjhbfknJhKKRgcgKDd3vkeOg@mail.gmail.com>
From: Laurent Bercot <ska-devel@skarnet.org>
Message-ID: <56978822.8020205@skarnet.org>
Date: Thu, 14 Jan 2016 12:36:02 +0100
User-Agent: Mozilla/5.0 (Windows NT 10.0; WOW64; rv:38.0) Gecko/20100101
 Thunderbird/38.5.1
Precedence: list
MIME-Version: 1.0
In-Reply-To: <CAFsQEP0_iCUyuJgXbpDXORMry_SjhbfknJhKKRgcgKDd3vkeOg@mail.gmail.com>
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 7bit

On 14/01/2016 06:55, Jude Nelson wrote:
> I think you're close.  The jist of it is that vdev needs to supply a
> lot more information than the kernel gives it.  In particular, its
> helper programs go on to query the properties and status of each
> device (this often requires root privileges, i.e. via privileged
> ioctl()s), and vdev gathers the information into a (much larger)
> event packet and stores it in a directory tree under /dev for
> subsequent query by less-privileged programs.

  I see.
  I think this is exactly what could be made modular. I've heard
people say they were reluctant to using vdev because it's not KISS, and
I suspect the ioctl machinery and data gathering is a large part of
the complexity. If that part could be pluggable, i.e. if admins could
choose a "data gatherer" just complex enough for their needs, I believe
it could encourage adoption. In other words, I'm looking at a 3-part
program:
  - the netlink listener
  - the data gatherer
  - the event publisher

  Of course, for libudev to work, you would need the full data gatherer;
but if people aren't using libudev programs, they can use a simpler one,
closer to what mdev is doing.
  It's all from a very high point-of-view, and I don't know the details of
the code so I have no idea whether it's envisionable for vdev, but that's
what I'm thinking off the top of my head.


> Funny you mention this--I also created runfs
> (https://github.com/jcnelson/runfs) to do exactly this.  In
> particular, I use it for PID files.

  I have no love for mechanisms that help people keep using PID files,
which are an ugly relic that can't end up in the museum of mediaeval
programming soon enough. :P
  That said, runfs is interesting, and I would love it if Unix provided
such a mechanism. Unfortunately, for now it has to rely on FUSE, which
is one of the most clunky mutant features of Linux, and an extra layer
of complexity; so I find it cleaner if a program can achieve its
functionality without depending on such a filesystem.


> I agree that netlink is lighter, but I avoided it for two reasons:
> * Sometime down the road, I'd like to port vdev to OpenBSD.

  That's a good reason, and an additional reason to separate the
netlink listener from the event publisher (and the data gatherer).
The event publisher and client library can be made 100% portable,
whereas the netlink listener and data gatherer obviously cannot.


> * There is no way to namespace netlink messages that I'm aware of.

  I didn't know that - I'm no netlink expert. But that's also a good
reason. AFAICT, there are 32 netlink multicast groups, and they use
hardcoded numbers - this is ugly, or at least requires a global
registry of what group is used for. If you can't namespace them, it
becomes even more of a scarce resource; although it's legitimate to
use one for uevent publishing, I'm pretty sure people will find a way
to clog them with random crap very soon - better stay away from
resources you can't reliably lock. And from what you're saying, even
systemd people have realized that. :)

  I'm not advocating netlink use for anything else than reading kernel
events. It's just that true multicast will be more efficient than manual
broadcast, there's no way around it.


> By using a synthetic filesystem for
> message transport, I can use bind-mounts to control which device
> events get routed to which containers

  I'm torn between "oooh, clever" and "omg this hack is atrocious". :)


> Yes, modulo some other mechanisms to ensure that the libudev-compat
> process doesn't get back-logged and lose messages.

  What do you mean by that?
  If libudev-compat is, like libudev, linked into the application, then
you have no control over client behaviour; if a client doesn't properly
act on a notification, then there's nothing you can do about it and
it's not your responsibility. Can you give a few details about what
you're doing client-side?


> I think both approaches are good ideas and would work just as well.
> I really like skabus's approach--I'll take a look at using it for
> message delivery as an additional (preferred?) vdev-to-libudev-compat
> message delivery mechanism :)  It looks like it offers all the
> aforementioned benefits over netlink that I'm looking for.

  Unfortunately, it's not published yet, because there's still a lot
of work to be done on clients. And now I'm wondering whether it would
be more efficient to store messages in anonymous files and transmit
fds, instead of transmitting copies of messages. I may have to rewrite
stuff. :)
  I think I'll be able to get back to work on skabus by the end of this
year - but no promises, since I'll be working on the Alpine init system
as soon as I'm done with my current contract. But I can leak a few
pieces of source code if you're interested.


> A question on the implementation--what do you think of having each
> subscriber create its own Unix domain socket in a canonical
> directory, and having the sender connect as a client to each
> subscriber?

  That's exactly how fifodirs work, with pipes instead of sockets.
  But I don't think that's a good fit here.

  A point of fifodirs is to have many-to-many communication: there
are several subscribers, but there can also be several publishers
(even if in practice there's often only one publisher). Publishers and
subscribers are completely independent.
  Here, you only ever have one publisher: the event dispatcher. You
only ever need one-to-many communication.

  Another point of fifodirs is to avoid the need for a daemon to act
as a bus. It's notification that happens between unrelated processes
without requiring a central server to ensure the communication.
It's important because I didn't want my supervision system (which is
supposed to manage daemons) to itself rely on a daemon (which would
then have to be unsupervised).
  Here, you don't have that requirement, and you already have a daemon:
the event dispatcher is long-lived.

  I think a "socketdir" mechanism is just too heavy:
  - for every event, you perform opendir(), readdir() and closedir()
  - for every event * subscriber, you perform at least socket(), connect(),
sendmsg() and close()
  - the client library needs to listen() and accept(), which means it
needs its own thread (and I hate, hate, hate, libraries that pull in
thread support in my otherwise single-threaded programs)
  - the client library needs to perform access control on the socket,
to avoid connects from unrelated processes, and even then you can't
be certain it's the event publisher and not a random root process

  You definitely don't want a client library to be listen()ing.
listen() is server stuff - mixing client and server stuff is complex.
Too much so for what you need here.


>  Since each subscriber needs its own fd to read and
> close, the directory of subscriber sockets automatically gives the
> sender a list of who to communicate with and a count of how many fds
> to create.  It also makes it easy to detect and clean up a dead
> subscriber's socket:  the sender can request a struct ucred from a
> subscriber to get its PID (and then other details from /proc), and if
> the process ever exits (which the sender can detect on Linux using a
> netlink process monitor, like [1]), the process that created the
> socket can be assumed to be dead and the sender can unlink it.  The
> sender would rely on additional process instance-identifying
> information from /proc (like its start-time) to avoid PID-reuse
> races.

  Bleh. Of course it can be made to work, but you really don't need all
that complexity. You have a daemon that wants to publish data, and
several clients that want to receive data from that daemon: it's
one (long-lived) to many (short-lived) communication, and there's a
perfectly appropriate, simple and portable IPC for that: a single Unix
domain socket that your daemon listens on and your clients connect to.
  If you want to be perfectly reliable, you can implement some kind of
autoreconnect in the client library - in case you want to restart the
event publisher without killing X, for instance. But that's still a
lot simpler than playing with multiple sockets and mixing clients and
serverswhen you don't need to.


> Thanks again for all your input!

  No problem. I love design discussions, I can't get enough of them.
(The reason why I left the Devuan mailing-list is that there was too
much ideological mumbo-jumbo, and not enough technical/design stuff.
Speaking of which, my apologies to Alpine devs for hijacking their ML;
if it's too OT/uninteresting, we'll take the discussion elsewhere.)

-- 
  Laurent


---
Unsubscribe:  alpine-devel+unsubscribe@lists.alpinelinux.org
Help:         alpine-devel+help@lists.alpinelinux.org
---