X-Original-To: alpine-devel@mail.alpinelinux.org Delivered-To: alpine-devel@mail.alpinelinux.org Received: from mail.alpinelinux.org (dallas-a1.alpinelinux.org [127.0.0.1]) by mail.alpinelinux.org (Postfix) with ESMTP id 693B4DC0BEE for ; Thu, 14 Jan 2016 05:55:59 +0000 (UTC) Received: from mail-ob0-f169.google.com (mail-ob0-f169.google.com [209.85.214.169]) (using TLSv1 with cipher ECDHE-RSA-RC4-SHA (128/128 bits)) (No client certificate requested) by mail.alpinelinux.org (Postfix) with ESMTPS id 3CDDFDC014B for ; Thu, 14 Jan 2016 05:55:58 +0000 (UTC) Received: by mail-ob0-f169.google.com with SMTP id is5so74863806obc.0 for ; Wed, 13 Jan 2016 21:55:58 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :cc:content-type; bh=AFmIe11Buc+N8GX1c5QpQhUXlecom/+JrtDlixXcvh8=; b=hmIN8ISPMgPkV8pVaoFukr3Su3xOJLcx316IE7KlcyaIuSGFFN0aAvN5xHNx9kfaXp qcJUC3gFK7I9jqmcHHF4v5B9Uuhng04ZWgiGOcGXiHNJxmu8MbujHzESbQs3zaQJpZtk sb+a8WEsswtFIQMxNXtzFcBYuSNJwDKkmbj8Vxa6CIACcHJH9U7Qt+NjBpihsm8dl0TO 0DQhXvH21jW8rRPF8tuWzeI54j++S3lSTlvWU0ZVsYoYo87Y4lK7YkWUmSgjfLk3b5vw RQtGuLxULb/dRMSwm4yNZhgWh21kZoxWLiLxyrUeig83g0rhlBPxCyb0WBKLOtaXCPpw YQyQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:in-reply-to:references:date :message-id:subject:from:to:cc:content-type; bh=AFmIe11Buc+N8GX1c5QpQhUXlecom/+JrtDlixXcvh8=; b=gHnVPPzleZanhqdz49SdINbszgtNzLTmeTTU5B0jQeMy17TakTdiJQPdgABsTkbO5c yB+dTYsZAJIjCb6Vz8GlWEbNM9vZB7cJPX0SkBlsNwudVuAa7XFbo4zOaEeTuw5BF+8T t2Xd52leq2zB891Mrc2Ke8XmbSAG+46AoEGaqaapR4VWO/X/8Rw8yF/LTVJFznA/5+8Q mx0URVN7vKueIMkIdpI1dhsspuFjDBP8cNCrOwDZ8WgMJhwzuKZORIwpSVSqsU8fQ0c/ kcYgzA3v9rAckXOD7p4mMa+K7nqldPgRXdIadC/nw6TqJNs9otgdUjkw2omvBmRH6Xfw QWzg== X-Gm-Message-State: ALoCoQkQ0ZLjdERC1GJzqDZoIR3CLd/LYWve+PCDA/wfHeJCEVKyUjSjKYgAg34rLaoGOvCMaajB9Ao8DkeJ80DKQsjKsmEklA== X-Mailinglist: alpine-devel Precedence: list List-Id: Alpine Development List-Unsubscribe: List-Post: List-Help: List-Subscribe: MIME-Version: 1.0 X-Received: by 10.182.28.66 with SMTP id z2mr1766080obg.32.1452750957889; Wed, 13 Jan 2016 21:55:57 -0800 (PST) Received: by 10.202.81.6 with HTTP; Wed, 13 Jan 2016 21:55:57 -0800 (PST) In-Reply-To: <56964414.1000605@skarnet.org> References: <20150727103737.4f95e523@ncopa-desktop.alpinelinux.org> <20150728052436.GC1923@newbook> <20160112153804.GI32545@example.net> <56953ABE.5090203@skarnet.org> <56958E22.90806@skarnet.org> <56964414.1000605@skarnet.org> Date: Thu, 14 Jan 2016 00:55:57 -0500 Message-ID: Subject: Re: [alpine-devel] udev replacement on Alpine Linux From: Jude Nelson To: Laurent Bercot Cc: alpine-devel@lists.alpinelinux.org Content-Type: multipart/alternative; boundary=089e0158b09ecacf4d052944f024 X-Virus-Scanned: ClamAV using ClamSMTP --089e0158b09ecacf4d052944f024 Content-Type: text/plain; charset=UTF-8 Hi Laurent, On Wed, Jan 13, 2016 at 7:33 AM, Laurent Bercot wrote: > On 13/01/2016 04:47, Jude Nelson wrote: > > I haven't tried this myself, but it should be doable. Vdev's >> event-propagation mechanism is a small program that constructs a >> uevent string from environment variables passed to it by vdev and >> writes the string to the appropriate place. The vdev daemon isn't >> aware of its existence; it simply executes it like it would for any >> another matching device-event action. Another device manager could >> supply the same program with the right environment variables and use >> it for the same purposes. >> > > Indeed. My question then becomes: what are the differences between > the string passed by the kernel (which is more or less a list of > environment variables, too) and the string constructed by vdev ? > In other words, is vdev itself more than a trivial netlink listener, > and if yes, what does it do ? (I'll just take a pointer to the > documentation if that question is answered somewhere.) > For now I'll take a wild guess and say that vdev analyzes the > MODALIAS or something, according to a conf file, in order to know > the correct fan-out to perform and write the event to the correct > subsystems. Am I close ? > (I should really sit down and write documentation sometime :) I think you're close. The jist of it is that vdev needs to supply a lot more information than the kernel gives it. In particular, its helper programs go on to query the properties and status of each device (this often requires root privileges, i.e. via privileged ioctl()s), and vdev gathers the information into a (much larger) event packet and stores it in a directory tree under /dev for subsequent query by less-privileged programs. It doesn't rely on the MODALIAS per se; instead it matches fields of the kernel's uevent packet (one of which is the MODALIAS) to the right helper programs to run. Here's an example of what vdev gathers for my laptop's SATA disk: $ cat /dev/metadata/dev/sda/properties VDEV_ATA=1 VDEV_WWN=0x5000c500299a9a7a VDEV_BUS=ata VDEV_SERIAL=ST9500420AS_5VJ7A0BM VDEV_SERIAL_SHORT=5VJ7A0BM VDEV_REVISION=0003LVM1 VDEV_TYPE=ata VDEV_MAJOR=8 VDEV_MINOR=0 VDEV_OS_SUBSYSTEM=block VDEV_OS_DEVTYPE=disk VDEV_OS_DEVPATH=/devices/pci0000:00/0000:00:1f.2/ata1/host0/target0:0:0/0:0:0:0/block/sda VDEV_OS_DEVNAME=sda VDEV_ATA=1 VDEV_ATA_TYPE=disk VDEV_ATA_MODEL=ST9500420AS VDEV_ATA_MODEL_ENC=ST9500420ASx20x20x20x20x20x20x20x20x20x20x20x20x20x20x20x20x20x20x20x20x20x20x20x20x20x20x20x20x20 VDEV_ATA_REVISION=0003LVM1 VDEV_ATA_SERIAL=ST9500420AS_5VJ7A0BM VDEV_ATA_SERIAL_SHORT=5VJ7A0BM VDEV_ATA_WRITE_CACHE=1 VDEV_ATA_WRITE_CACHE_ENABLED=1 VDEV_ATA_FEATURE_SET_HPA=1 VDEV_ATA_FEATURE_SET_HPA_ENABLED=1 VDEV_ATA_FEATURE_SET_PM=1 VDEV_ATA_FEATURE_SET_PM_ENABLED=1 VDEV_ATA_FEATURE_SET_SECURITY=1 VDEV_ATA_FEATURE_SET_SECURITY_ENABLED=0 VDEV_ATA_FEATURE_SET_SECURITY_ERASE_UNIT_MIN=100 VDEV_ATA_FEATURE_SET_SECURITY_ENHANCED_ERASE_UNIT_MIN=100 VDEV_ATA_FEATURE_SET_SECURITY_FROZEN=1 VDEV_ATA_FEATURE_SET_SMART=1 VDEV_ATA_FEATURE_SET_SMART_ENABLED=1 VDEV_ATA_FEATURE_SET_APM=1 VDEV_ATA_FEATURE_SET_APM_ENABLED=1 VDEV_ATA_FEATURE_SET_APM_CURRENT_VALUE=128 VDEV_ATA_DOWNLOAD_MICROCODE=1 VDEV_ATA_SATA=1 VDEV_ATA_SATA_SIGNAL_RATE_GEN2=1 VDEV_ATA_SATA_SIGNAL_RATE_GEN1=1 VDEV_ATA_ROTATION_RATE_RPM=7200 VDEV_ATA_WWN=0x5000c500299a9a7a VDEV_ATA_WWN_WITH_EXTENSION=0x5000c500299a9a7a Anything that starts with "VDEV_ATA_", as well as "VDEV_BUS", "VDEV_SERIAL_*", "VDEV_TYPE", and "VDEV_REVISION" had to be extracted via an ioctl, by exploring files in sysfs, or by querying a hardware database. The kernel only supplied a few of these fields. > > Tmpfs and devtmps are designed for holding ephemeral state already, >> so I'm not sure why the fact that they expose data as regular files >> is a concern? >> > > Two different meanings of "ephemeral". > tmpfs and devtmpfs are supposed to retain their data until the > end of the system's lifetime. An event is much more ephemeral > than that: it's supposed to be consumed instantly - like the > event from the kernel is consumed instantly by the netlink listener. > Files, even in a tmpfs, remain alive in the absence of a live > process to hold them; but events have no meaning if no process needs > them, which is the reason for the "event leaking" problem. > Ideally, you need a file type with basically the same lifetime > as a process. > > Holding event data in a file is perfectly valid as long as you have > a mechanism to reclaim the file as soon as the last reference to it > dies. > Funny you mention this--I also created runfs ( https://github.com/jcnelson/runfs) to do exactly this. In particular, I use it for PID files. Also, eventfs was actually derived from runfs, but specialized more to make it more suitable for managing event-queues. > > > I couldn't think of a simpler way that was also as robust. Unless >> I'm misunderstanding something, wrapping an arbitrary program to >> clean up the files it created would, in the extreme, require coming >> up with a way to do so on SIGKILL. I'd love to know if there is a >> simple way to do this, though. >> > > That's where supervisors come into play: the parent of a process > always knows when it dies, even on SIGKILL. Supervised daemons can > have a cleaner script in place. > For the general case, it shouldn't be hard to have a wrapper that > forks an arbitrary program and cleans up /dev/metadata/whatever/*$childpid* > when it dies. The price to pay is an additional process, but that > additional process would be very small. > You can still have a polling "catch-all cleaner" to collect dead events > in case the supervisor/wrapper also died, but since that occurrence will > be rare, the polling period can be pretty long so it's not a problem. > Agreed. I would be happy to keep this approach in mind in the design of libudev-compat. Eventfs isn't a hard requirement and I don't want it to be, since there's more than one way to deal with this problem. > > > I went with a specialized filesystem for two reasons; both of which >> were to fulfill libudev's API contract: * Efficient, reliable event >> multicasting. By using hard-links as described above, the event only >> needs to be written out once, and the OS only needs to store one >> copy. >> > > That's a good mechanism; you're already fulfilling that contract > with the non-eventfs implementation. > > > * Automatic multicast channel cleanup. Eventfs would ensure that no >> matter how a process dies, its multicast state would be come >> inaccessible and be reclaimed once it is dead (i.e. a subsequent >> filesystem operation on the orphaned state, no matter how soon after >> the process's exit, will fail). >> > > That's where storing events as files is problematic: files survive > processes. But I still don't think a specific fs is necessary: you can > either ensure files do not survive processes (see the supervisor/cleaner > idea above), or you can use another Unix mechanism (see below). > > > Both of the above are implicitly guaranteed by libudev, since it >> relies on a netlink multicast group shared with the udevd process >> to achieve them. >> > > And honestly, that's not a bad design. If you want to have multicast, > and you happen to have a true multicast IPC mechanism, might as well > use it. It will be hard to be as efficient as that: if you don't have > true multicast, you have to compromise somewhere. > I dare say using a netlink multicast group is lighter than designing > a FUSE filesystem to do the same thing. If you want the same > functionality, why didn't you adopt the same mechanism ? > I agree that netlink is lighter, but I avoided it for two reasons: * Sometime down the road, I'd like to port vdev to OpenBSD. Not because I believe that the OpenBSD project is in dire need of a dynamic device manager, but simply because it's the thing I miss the most when I'm using OpenBSD (personal preference). Netlink is Linux-specific, whereas FUSE works on pretty much every Unix these days. * There is no way to namespace netlink messages that I'm aware of. The kernel (and udev) sends the same device events to every container on the system--in fact, this is one of the major reasons cited by the systemd folks for moving off of netlink for udevd-to-libudev communications. By using a synthetic filesystem for message transport, I can use bind-mounts to control which device events get routed to which containers (this is also the reason why the late kdbus was implemented as a synthetic filesystem). Using fifodirs has the same benefit :) > > (It can be made modular. You can have a uevent listener that just gets > the event from the kernel and transmits it to the event manager; and > the chosen event manager multicasts it.) > > Good point; something I'll keep in mind in the future evolution of libudev-compat :) > > It is my understanding (please correct me if I'm wrong) that with >> s6-ftrig-*, I would need to write out the event data to each >> listener's pipe (i.e. once per struct udev_monitor instance), and I >> would still be responsible for cleaning up the fifodir every now and >> then if the libudev-compat client failed to do so itself. Is my >> understanding correct? >> > > Yes and no. I'm not suggesting you to use libftrig for your purpose. :) > > * My concern with libftrig was never event storage: it was > many-to-many notification. I didn't design it to transmit arbitrary > amounts of data, but to instantly wake up processes when something > happens; data transmission *is* possible, but the original idea is > to send one byte at a time, for just 256 types of event. > > Notification and data transmission are orthogonal concepts. It's > always possible to store data somewhere and notify processes that > data is available; then processes can fetch the data. Data > transmission can be pull, whereas notification has to be push. > libftrig is only about the push. > > Leaking space is not a concern with libftrig, because fifodirs > never store data, only pipes; at worst, they leak a few inodes. > That is why a polling cleaner is sufficient: even if multiple > subscribers get SIGKILLed, they will only leave behind a few > fifos, and no data - so sweeping now and then is more than enough. > It's different if you're storing data, because leaks can be much > more problematic. > > * Unless you have true multicast, you will have to push a > notification as many times as you have listeners, no matter what. > That's what I'm doing when writing to all the fifos in a fifodir. > That's what you are doing when linking the event into every > subscriber's directory. I guess your subscriber library uses some > kind of inotify to know when a new file has arrived? > Yes, modulo some other mechanisms to ensure that the libudev-compat process doesn't get back-logged and lose messages. I completely agree with you about the benefits of separating notification (control-plane) from message delivery (data-plane). > > > Again, I would love to know of a simpler approach that is just as >> robust. >> > > Whenever you have "pull" data transmission, you necessarily have the > problem of storage lifetime. Here, as often, what you want is > reference counting: when the last handle to the data disappears, the data > is automatically collected. > The problem is that your current handle, an inode, is not tied to the > subscriber's lifetime. You want a type of handle that will die with the > process. > File descriptors fit this. > > So, an idea would be to do something like: > - Your event manager listens to a Unix domain socket. > - Your subscribers connect to that socket. > - For every event: > + the event manager stores the event into an anonymous file (e.g. a file > in a tmpfs that is unlinked as soon as it is created) while keeping a > reading fd on it > + the event manager sends a copy of the reading fd, via fd-passing, > to every subscriber. This counts as a notification, since it will wake up > subscribers. > + the event manager closes its own fd to the file. > + subscribers will read the fd when they so choose, and they will > close it afterwards. The kernel will also close it when they die, so you > won't leak any data. > > Of course, at that point, you may as well give up and just push the > whole event over the Unix socket. It's what udevd does, except it uses a > netlink multicast group instead of a normal socket (so its complexity is > independent from the number of subscribers). Honestly, given that the > number of subscribers will likely be small, and your events probably aren't > too large either, it's the simplest design - it's what I'd go for. > (I even already have the daemon to do it, as a part of skabus. Sending > data to subscribers is exactly what a pubsub does.) > > But if you estimate that the amount of data is too large and you don't > want to copy it, then you can just send a fd instead. It's still > manual broadcast, but it's not in O(event length * subscribers), it's in > O(subscribers), i.e. the same complexity as your "hard link the event > file" strategy; and it has the exact storage properties that you want. > > What do you think ? I think both approaches are good ideas and would work just as well. I really like skabus's approach--I'll take a look at using it for message delivery as an additional (preferred?) vdev-to-libudev-compat message delivery mechanism :) It looks like it offers all the aforementioned benefits over netlink that I'm looking for. A question on the implementation--what do you think of having each subscriber create its own Unix domain socket in a canonical directory, and having the sender connect as a client to each subscriber? Since each subscriber needs its own fd to read and close, the directory of subscriber sockets automatically gives the sender a list of who to communicate with and a count of how many fds to create. It also makes it easy to detect and clean up a dead subscriber's socket: the sender can request a struct ucred from a subscriber to get its PID (and then other details from /proc), and if the process ever exits (which the sender can detect on Linux using a netlink process monitor, like [1]), the process that created the socket can be assumed to be dead and the sender can unlink it. The sender would rely on additional process instance-identifying information from /proc (like its start-time) to avoid PID-reuse races. Thanks again for all your input! -Jude [1] http://bewareofgeek.livejournal.com/2945.html?page=1 --089e0158b09ecacf4d052944f024 Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable
Hi Laurent,

On Wed, Jan 13, 2016 at 7:33 AM, Laurent Bercot <ska= -devel@skarnet.org> wrote:
On 13/01= /2016 04:47, Jude Nelson wrote:

I haven't tried this myself, but it should be doable.=C2=A0 Vdev's<= br> event-propagation mechanism is a small program that constructs a
uevent string from environment variables passed to it by vdev and
writes the string to the appropriate place.=C2=A0 The vdev daemon isn't=
aware of its existence; it simply executes it like it would for any
another matching device-event action.=C2=A0 Another device manager could supply the same program with the right environment variables and use
=C2=A0it for the same purposes.

=C2=A0Indeed. My question then becomes: what are the differences between the string passed by the kernel (which is more or less a list of
environment variables, too) and the string constructed by vdev ?
In other words, is vdev itself more than a trivial netlink listener,
and if yes, what does it do ? (I'll just take a pointer to the
documentation if that question is answered somewhere.)
For now I'll take a wild guess and say that vdev analyzes the
MODALIAS or something, according to a conf file, in order to know
the correct fan-out to perform and write the event to the correct
subsystems. Am I close ?

(= I should really sit down and write documentation sometime :)

I think= you're close.=C2=A0 The jist of it is that vdev needs to supply a lot = more information than the kernel gives it.=C2=A0 In particular, its helper = programs go on to query the properties and status of each device (this ofte= n requires root privileges, i.e. via privileged ioctl()s), and vdev gathers= the information into a (much larger) event packet and stores it in a direc= tory tree under /dev for subsequent query by less-privileged programs.=C2= =A0 It doesn't rely on the MODALIAS per se; instead it matches fields o= f the kernel's uevent packet (one of which is the MODALIAS) to the righ= t helper programs to run.

Here's an example of what v= dev gathers for my laptop's SATA disk:

$ cat /dev/metadata/dev/s= da/properties
VDEV_ATA=3D1
VDEV_WWN=3D0x5000c500299a9a7a
VDEV_BUS= =3Data
VDEV_SERIAL=3DST9500420AS_5VJ7A0BM
VDEV_SERIAL_SHORT=3D5VJ7A0B= M
VDEV_REVISION=3D0003LVM1
VDEV_TYPE=3Data
VDEV_MAJOR=3D8
VDEV_= MINOR=3D0
VDEV_OS_SUBSYSTEM=3Dblock
VDEV_OS_DEVTYPE=3Ddisk
VDEV_OS= _DEVPATH=3D/devices/pci0000:00/0000:00:1f.2/ata1/host0/target0:0:0/0:0:0:0/= block/sda
VDEV_OS_DEVNAME=3Dsda
VDEV_ATA=3D1
VDEV_ATA_TYPE=3Ddisk<= br>VDEV_ATA_MODEL=3DST9500420AS
VDEV_ATA_MODEL_ENC=3DST9500420ASx20x20x2= 0x20x20x20x20x20x20x20x20x20x20x20x20x20x20x20x20x20x20x20x20x20x20x20x20x2= 0x20
VDEV_ATA_REVISION=3D0003LVM1
VDEV_ATA_SERIAL=3DST9500420AS_5VJ7A= 0BM
VDEV_ATA_SERIAL_SHORT=3D5VJ7A0BM
VDEV_ATA_WRITE_CACHE=3D1
VDEV= _ATA_WRITE_CACHE_ENABLED=3D1
VDEV_ATA_FEATURE_SET_HPA=3D1
VDEV_ATA_FE= ATURE_SET_HPA_ENABLED=3D1
VDEV_ATA_FEATURE_SET_PM=3D1
VDEV_ATA_FEATUR= E_SET_PM_ENABLED=3D1
VDEV_ATA_FEATURE_SET_SECURITY=3D1
VDEV_ATA_FEATU= RE_SET_SECURITY_ENABLED=3D0
VDEV_ATA_FEATURE_SET_SECURITY_ERASE_UNIT_MIN= =3D100
VDEV_ATA_FEATURE_SET_SECURITY_ENHANCED_ERASE_UNIT_MIN=3D100
VD= EV_ATA_FEATURE_SET_SECURITY_FROZEN=3D1
VDEV_ATA_FEATURE_SET_SMART=3D1VDEV_ATA_FEATURE_SET_SMART_ENABLED=3D1
VDEV_ATA_FEATURE_SET_APM=3D1
= VDEV_ATA_FEATURE_SET_APM_ENABLED=3D1
VDEV_ATA_FEATURE_SET_APM_CURRENT_VA= LUE=3D128
VDEV_ATA_DOWNLOAD_MICROCODE=3D1
VDEV_ATA_SATA=3D1
VDEV_A= TA_SATA_SIGNAL_RATE_GEN2=3D1
VDEV_ATA_SATA_SIGNAL_RATE_GEN1=3D1
VDEV_= ATA_ROTATION_RATE_RPM=3D7200
VDEV_ATA_WWN=3D0x5000c500299a9a7a
VDEV_A= TA_WWN_WITH_EXTENSION=3D0x5000c500299a9a7a

Anything that = starts with "VDEV_ATA_", as well as "VDEV_BUS", "V= DEV_SERIAL_*", "VDEV_TYPE", and "VDEV_REVISION" ha= d to be extracted via an ioctl, by exploring files in sysfs, or by querying= a hardware database.=C2=A0 The kernel only supplied a few of these fields.=



Tmpfs and devtmps are designed for holding ephemeral state already,
so I'm not sure why the fact that they expose data as regular files
is a concern?

=C2=A0Two different meanings of "ephemeral".
=C2=A0tmpfs and devtmpfs are supposed to retain their data until the
end of the system's lifetime. An event is much more ephemeral
than that: it's supposed to be consumed instantly - like the
event from the kernel is consumed instantly by the netlink listener.
Files, even in a tmpfs, remain alive in the absence of a live
process to hold them; but events have no meaning if no process needs
them, which is the reason for the "event leaking" problem.
Ideally, you need a file type with basically the same lifetime
as a process.

=C2=A0Holding event data in a file is perfectly valid as long as you have a mechanism to reclaim the file as soon as the last reference to it
dies.

Funny you mention this--I also = created runfs=20 (https://gi= thub.com/jcnelson/runfs) to do exactly this.=C2=A0 In particular, I use= it for PID files.=C2=A0 Also, eventfs was=20 actually derived from runfs, but specialized more to make it more suitable = for managing event-queues.
=C2=A0


I couldn't think of a simpler way that was also as robust.=C2=A0 Unless=
I'm misunderstanding something, wrapping an arbitrary program to
clean up the files it created would, in the extreme, require coming
up with a way to do so on SIGKILL.=C2=A0 I'd love to know if there is a=
simple way to do this, though.

=C2=A0That's where supervisors come into play: the parent of a process<= br> always knows when it dies, even on SIGKILL. Supervised daemons can
have a cleaner script in place.
=C2=A0For the general case, it shouldn't be hard to have a wrapper that=
forks an arbitrary program and cleans up /dev/metadata/whatever/*$childpid*=
when it dies. The price to pay is an additional process, but that
additional process would be very small.
=C2=A0You can still have a polling "catch-all cleaner" to collect= dead events
in case the supervisor/wrapper also died, but since that occurrence will be rare, the polling period can be pretty long so it's not a problem.

Agreed.=C2=A0 I would be ha= ppy to keep this approach in mind in the design of libudev-compat.=C2=A0 Ev= entfs isn't a hard requirement and I don't want it to be, since the= re's more than one way to deal with this problem.
=C2=A0


I went with a specialized filesystem for two reasons; both of which
were to fulfill libudev's API contract: * Efficient, reliable event
multicasting.=C2=A0 By using hard-links as described above, the event only<= br> needs to be written out once, and the OS only needs to store one
copy.

=C2=A0That's a good mechanism; you're already fulfilling that contr= act
with the non-eventfs implementation.


* Automatic multicast channel cleanup.=C2=A0 Eventfs would ensure that no matter how a process dies, its multicast state would be come
inaccessible and be reclaimed once it is dead (i.e. a subsequent
filesystem operation on the orphaned state, no matter how soon after
=C2=A0the process's exit, will fail).

=C2=A0That's where storing events as files is problematic: files surviv= e
processes. But I still don't think a specific fs is necessary: you can<= br> either ensure files do not survive processes (see the supervisor/cleaner idea above), or you can use another Unix mechanism (see below).


Both of the above are implicitly guaranteed by libudev, since it
relies on a netlink multicast group shared with the udevd process
to achieve them.

=C2=A0And honestly, that's not a bad design. If you want to have multic= ast,
and you happen to have a true multicast IPC mechanism, might as well
use it. It will be hard to be as efficient as that: if you don't have true multicast, you have to compromise somewhere.
=C2=A0I dare say using a netlink multicast group is lighter than designing<= br> a FUSE filesystem to do the same thing. If you want the same
functionality, why didn't you adopt the same mechanism ?

I agree that netlink is lighter, but I avoided it for= two reasons:
* Sometime down the road, I'd like to port vdev to Ope= nBSD.=C2=A0 Not because I believe that the OpenBSD project is in dire need = of a dynamic device manager, but simply because it's the thing I miss t= he most when I'm using OpenBSD (personal preference).=C2=A0 Netlink is = Linux-specific, whereas FUSE works on pretty much every Unix these days.
* There is no way to namespace netlink messages that I'm aw= are of.=C2=A0 The kernel (and udev) sends the same device events to every c= ontainer on the system--in fact, this is one of the major reasons cited by = the systemd folks for moving off of netlink for udevd-to-libudev communicat= ions.=C2=A0 By using a synthetic filesystem for message transport, I can us= e bind-mounts to control which device events get routed to which containers= (this is also the reason why the late kdbus was implemented as a synthetic= filesystem).=C2=A0 Using fifodirs has the same benefit :)
= =C2=A0

(It can be made modular. You can have a uevent listener that just gets
the event from the kernel and transmits it to the event manager; and
the chosen event manager multicasts it.)


Good point; something I'll = keep in mind in the future evolution of libudev-compat :)
=C2=A0

It is my understanding (please correct me if I'm wrong) that with
s6-ftrig-*, I would need to write out the event data to each
listener's pipe (i.e. once per struct udev_monitor instance), and I
would still be responsible for cleaning up the fifodir every now and
=C2=A0then if the libudev-compat client failed to do so itself.=C2=A0 Is my=
understanding correct?

=C2=A0Yes and no. I'm not suggesting you to use libftrig for your purpo= se. :)

* My concern with libftrig was never event storage: it was
many-to-many notification. I didn't design it to transmit arbitrary
amounts of data, but to instantly wake up processes when something
happens; data transmission *is* possible, but the original idea is
to send one byte at a time, for just 256 types of event.

=C2=A0Notification and data transmission are orthogonal concepts. It's<= br> always possible to store data somewhere and notify processes that
data is available; then processes can fetch the data. Data
transmission can be pull, whereas notification has to be push.
libftrig is only about the push.

=C2=A0Leaking space is not a concern with libftrig, because fifodirs
never store data, only pipes; at worst, they leak a few inodes.
That is why a polling cleaner is sufficient: even if multiple
subscribers get SIGKILLed, they will only leave behind a few
fifos, and no data - so sweeping now and then is more than enough.
It's different if you're storing data, because leaks can be much more problematic.

* Unless you have true multicast, you will have to push a
notification as many times as you have listeners, no matter what.
That's what I'm doing when writing to all the fifos in a fifodir. That's what you are doing when linking the event into every
subscriber's directory. I guess your subscriber library uses some
kind of inotify to know when a new file has arrived?

Yes, modulo some other mechanisms to ensure that= the libudev-compat process doesn't get back-logged and lose messages.= =C2=A0 I completely agree with you about the benefits of separating notific= ation (control-plane) from message delivery (data-plane).
=C2= =A0


Again, I would love to know of a simpler approach that is just as
robust.

=C2=A0Whenever you have "pull" data transmission, you necessarily= have the
problem of storage lifetime. Here, as often, what you want is
reference counting: when the last handle to the data disappears, the data is automatically collected.
=C2=A0The problem is that your current handle, an inode, is not tied to the=
subscriber's lifetime. You want a type of handle that will die with the=
process.
=C2=A0File descriptors fit this.

=C2=A0So, an idea would be to do something like:
=C2=A0- Your event manager listens to a Unix domain socket.
=C2=A0- Your subscribers connect to that socket.
=C2=A0- For every event:
=C2=A0 =C2=A0+ the event manager stores the event into an anonymous file (e= .g. a file
in a tmpfs that is unlinked as soon as it is created) while keeping a
reading fd on it
=C2=A0 =C2=A0+ the event manager sends a copy of the reading fd, via fd-pas= sing,
to every subscriber. This counts as a notification, since it will wake up subscribers.
=C2=A0 =C2=A0+ the event manager closes its own fd to the file.
=C2=A0 =C2=A0+ subscribers will read the fd when they so choose, and they w= ill
close it afterwards. The kernel will also close it when they die, so you won't leak any data.

=C2=A0Of course, at that point, you may as well give up and just push the whole event over the Unix socket. It's what udevd does, except it uses = a
netlink multicast group instead of a normal socket (so its complexity is independent from the number of subscribers). Honestly, given that the
number of subscribers will likely be small, and your events probably aren&#= 39;t
too large either, it's the simplest design - it's what I'd go f= or.
(I even already have the daemon to do it, as a part of skabus. Sending
data to subscribers is exactly what a pubsub does.)

=C2=A0But if you estimate that the amount of data is too large and you don&= #39;t
want to copy it, then you can just send a fd instead. It's still
manual broadcast, but it's not in O(event length * subscribers), it'= ;s in
O(subscribers), i.e. the same complexity as your "hard link the event = file" strategy; and it has the exact storage properties that you want.=

=C2=A0What do you think ?

I think both approaches are = good ideas and would work just as well.=C2=A0 I really like skabus's ap= proach--I'll take a look at using it for message delivery as an additio= nal (preferred?) vdev-to-libudev-compat message delivery mechanism :) =C2= =A0It looks like it offers all the aforementioned benefits over netlink tha= t I'm looking for.

A question on the implement= ation--what do you think of having each subscriber create its own Unix doma= in socket in a canonical directory, and having the sender connect as a clie= nt to each subscriber?=C2=A0 Since each subscriber needs its own fd to read= and close, the directory of subscriber sockets automatically gives the sen= der a list of who to communicate with and a count of how many fds to create= .=C2=A0 It also makes it easy to detect and clean up a dead subscriber'= s socket: =C2=A0the sender can request a struct ucred from a subscriber to = get its PID (and then other details from /proc), and if the process ever ex= its (which the sender can detect on Linux using a netlink process monitor, = like [1]), the process that created the socket can be assumed to be dead an= d the sender can unlink it.=C2=A0 The sender would rely on additional proce= ss instance-identifying information from /proc (like its start-time) to avo= id PID-reuse races.
=C2=A0
Thanks again for all your in= put!
-Jude


--089e0158b09ecacf4d052944f024-- --- Unsubscribe: alpine-devel+unsubscribe@lists.alpinelinux.org Help: alpine-devel+help@lists.alpinelinux.org ---