Subject: Re: [alpine-devel] udev replacement on Alpine Linux
To: alpine-devel@lists.alpinelinux.org
References: <CAO6-m02qfTUZeW9a0+WZY9eqEgnr50T27m4vPy5Y6unc64kNeA@mail.gmail.com>
 <20150727103737.4f95e523@ncopa-desktop.alpinelinux.org>
 <20150728052436.GC1923@newbook>
 <CAO6-m02gSo0CeP7F=gTpsMukiNj+CUTM+FkJG86k-kGU=ZjVFA@mail.gmail.com>
 <20160112153804.GI32545@example.net> <56953ABE.5090203@skarnet.org>
 <CAFsQEP2Cx23cGcCTZ8e1RK+LOPtbvi=H6xUwuPH=9WqFQYg51A@mail.gmail.com>
 <56958E22.90806@skarnet.org>
 <CAFsQEP3K3HeP8dTx_Mi=hJYFDg9o3WmoYPsFcEN76z1pupR4pw@mail.gmail.com>
From: Laurent Bercot <ska-devel@skarnet.org>
Message-ID: <56964414.1000605@skarnet.org>
Date: Wed, 13 Jan 2016 13:33:24 +0100
User-Agent: Mozilla/5.0 (Windows NT 10.0; WOW64; rv:38.0) Gecko/20100101
 Thunderbird/38.5.1
Precedence: list
MIME-Version: 1.0
In-Reply-To: <CAFsQEP3K3HeP8dTx_Mi=hJYFDg9o3WmoYPsFcEN76z1pupR4pw@mail.gmail.com>
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 7bit

On 13/01/2016 04:47, Jude Nelson wrote:

> I haven't tried this myself, but it should be doable.  Vdev's
> event-propagation mechanism is a small program that constructs a
> uevent string from environment variables passed to it by vdev and
> writes the string to the appropriate place.  The vdev daemon isn't
> aware of its existence; it simply executes it like it would for any
> another matching device-event action.  Another device manager could
> supply the same program with the right environment variables and use
>  it for the same purposes.

  Indeed. My question then becomes: what are the differences between
the string passed by the kernel (which is more or less a list of
environment variables, too) and the string constructed by vdev ?
In other words, is vdev itself more than a trivial netlink listener,
and if yes, what does it do ? (I'll just take a pointer to the
documentation if that question is answered somewhere.)
For now I'll take a wild guess and say that vdev analyzes the
MODALIAS or something, according to a conf file, in order to know
the correct fan-out to perform and write the event to the correct
subsystems. Am I close ?


> Tmpfs and devtmps are designed for holding ephemeral state already,
> so I'm not sure why the fact that they expose data as regular files
> is a concern?

  Two different meanings of "ephemeral".
  tmpfs and devtmpfs are supposed to retain their data until the
end of the system's lifetime. An event is much more ephemeral
than that: it's supposed to be consumed instantly - like the
event from the kernel is consumed instantly by the netlink listener.
Files, even in a tmpfs, remain alive in the absence of a live
process to hold them; but events have no meaning if no process needs
them, which is the reason for the "event leaking" problem.
Ideally, you need a file type with basically the same lifetime
as a process.

  Holding event data in a file is perfectly valid as long as you have
a mechanism to reclaim the file as soon as the last reference to it
dies.


> I couldn't think of a simpler way that was also as robust.  Unless
> I'm misunderstanding something, wrapping an arbitrary program to
> clean up the files it created would, in the extreme, require coming
> up with a way to do so on SIGKILL.  I'd love to know if there is a
> simple way to do this, though.

  That's where supervisors come into play: the parent of a process
always knows when it dies, even on SIGKILL. Supervised daemons can
have a cleaner script in place.
  For the general case, it shouldn't be hard to have a wrapper that
forks an arbitrary program and cleans up /dev/metadata/whatever/*$childpid*
when it dies. The price to pay is an additional process, but that
additional process would be very small.
  You can still have a polling "catch-all cleaner" to collect dead events
in case the supervisor/wrapper also died, but since that occurrence will
be rare, the polling period can be pretty long so it's not a problem.


> I went with a specialized filesystem for two reasons; both of which
> were to fulfill libudev's API contract: * Efficient, reliable event
> multicasting.  By using hard-links as described above, the event only
> needs to be written out once, and the OS only needs to store one
> copy.

  That's a good mechanism; you're already fulfilling that contract
with the non-eventfs implementation.


> * Automatic multicast channel cleanup.  Eventfs would ensure that no
> matter how a process dies, its multicast state would be come
> inaccessible and be reclaimed once it is dead (i.e. a subsequent
> filesystem operation on the orphaned state, no matter how soon after
>  the process's exit, will fail).

  That's where storing events as files is problematic: files survive
processes. But I still don't think a specific fs is necessary: you can
either ensure files do not survive processes (see the supervisor/cleaner
idea above), or you can use another Unix mechanism (see below).


> Both of the above are implicitly guaranteed by libudev, since it
> relies on a netlink multicast group shared with the udevd process
> to achieve them.

  And honestly, that's not a bad design. If you want to have multicast,
and you happen to have a true multicast IPC mechanism, might as well
use it. It will be hard to be as efficient as that: if you don't have
true multicast, you have to compromise somewhere.
  I dare say using a netlink multicast group is lighter than designing
a FUSE filesystem to do the same thing. If you want the same
functionality, why didn't you adopt the same mechanism ?

(It can be made modular. You can have a uevent listener that just gets
the event from the kernel and transmits it to the event manager; and
the chosen event manager multicasts it.)


> It is my understanding (please correct me if I'm wrong) that with
> s6-ftrig-*, I would need to write out the event data to each
> listener's pipe (i.e. once per struct udev_monitor instance), and I
> would still be responsible for cleaning up the fifodir every now and
>  then if the libudev-compat client failed to do so itself.  Is my
> understanding correct?

  Yes and no. I'm not suggesting you to use libftrig for your purpose. :)

* My concern with libftrig was never event storage: it was
many-to-many notification. I didn't design it to transmit arbitrary
amounts of data, but to instantly wake up processes when something
happens; data transmission *is* possible, but the original idea is
to send one byte at a time, for just 256 types of event.

  Notification and data transmission are orthogonal concepts. It's
always possible to store data somewhere and notify processes that
data is available; then processes can fetch the data. Data
transmission can be pull, whereas notification has to be push.
libftrig is only about the push.

  Leaking space is not a concern with libftrig, because fifodirs
never store data, only pipes; at worst, they leak a few inodes.
That is why a polling cleaner is sufficient: even if multiple
subscribers get SIGKILLed, they will only leave behind a few
fifos, and no data - so sweeping now and then is more than enough.
It's different if you're storing data, because leaks can be much
more problematic.

* Unless you have true multicast, you will have to push a
notification as many times as you have listeners, no matter what.
That's what I'm doing when writing to all the fifos in a fifodir.
That's what you are doing when linking the event into every
subscriber's directory. I guess your subscriber library uses some
kind of inotify to know when a new file has arrived?


> Again, I would love to know of a simpler approach that is just as
> robust.

  Whenever you have "pull" data transmission, you necessarily have the
problem of storage lifetime. Here, as often, what you want is
reference counting: when the last handle to the data disappears, the data
is automatically collected.
  The problem is that your current handle, an inode, is not tied to the
subscriber's lifetime. You want a type of handle that will die with the
process.
  File descriptors fit this.

  So, an idea would be to do something like:
  - Your event manager listens to a Unix domain socket.
  - Your subscribers connect to that socket.
  - For every event:
    + the event manager stores the event into an anonymous file (e.g. a file
in a tmpfs that is unlinked as soon as it is created) while keeping a
reading fd on it
    + the event manager sends a copy of the reading fd, via fd-passing,
to every subscriber. This counts as a notification, since it will wake up
subscribers.
    + the event manager closes its own fd to the file.
    + subscribers will read the fd when they so choose, and they will
close it afterwards. The kernel will also close it when they die, so you
won't leak any data.

  Of course, at that point, you may as well give up and just push the
whole event over the Unix socket. It's what udevd does, except it uses a
netlink multicast group instead of a normal socket (so its complexity is
independent from the number of subscribers). Honestly, given that the
number of subscribers will likely be small, and your events probably aren't
too large either, it's the simplest design - it's what I'd go for.
(I even already have the daemon to do it, as a part of skabus. Sending
data to subscribers is exactly what a pubsub does.)

  But if you estimate that the amount of data is too large and you don't
want to copy it, then you can just send a fd instead. It's still
manual broadcast, but it's not in O(event length * subscribers), it's in
O(subscribers), i.e. the same complexity as your "hard link the event file" strategy; and it has the exact storage properties that you want.

  What do you think ?

-- 
  Laurent


---
Unsubscribe:  alpine-devel+unsubscribe@lists.alpinelinux.org
Help:         alpine-devel+help@lists.alpinelinux.org
---