X-Original-To: alpine-devel@mail.alpinelinux.org Delivered-To: alpine-devel@mail.alpinelinux.org Received: from mail.alpinelinux.org (dallas-a1.alpinelinux.org [127.0.0.1]) by mail.alpinelinux.org (Postfix) with ESMTP id 714CEDCFCCF for ; Wed, 13 Jan 2016 12:33:23 +0000 (UTC) Received: from smtp1.tech.numericable.fr (smtp1.tech.numericable.fr [82.216.111.37]) (using TLSv1 with cipher ECDHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by mail.alpinelinux.org (Postfix) with ESMTPS id 073D1DCF95D for ; Wed, 13 Jan 2016 12:33:22 +0000 (UTC) Received: from sinay.internal.skarnet.org (ip-62.net-82-216-6.versailles2.rev.numericable.fr [82.216.6.62]) by smtp1.tech.numericable.fr (Postfix) with SMTP id 0DDCA140566 for ; Wed, 13 Jan 2016 13:33:20 +0100 (CET) Received: (qmail 22096 invoked from network); 13 Jan 2016 12:33:46 -0000 Received: from elzian.internal.skarnet.org. (HELO ?192.168.0.2?) (192.168.0.2) by sinay.internal.skarnet.org. with SMTP; 13 Jan 2016 12:33:46 -0000 Subject: Re: [alpine-devel] udev replacement on Alpine Linux To: alpine-devel@lists.alpinelinux.org References: <20150727103737.4f95e523@ncopa-desktop.alpinelinux.org> <20150728052436.GC1923@newbook> <20160112153804.GI32545@example.net> <56953ABE.5090203@skarnet.org> <56958E22.90806@skarnet.org> From: Laurent Bercot Message-ID: <56964414.1000605@skarnet.org> Date: Wed, 13 Jan 2016 13:33:24 +0100 User-Agent: Mozilla/5.0 (Windows NT 10.0; WOW64; rv:38.0) Gecko/20100101 Thunderbird/38.5.1 X-Mailinglist: alpine-devel Precedence: list List-Id: Alpine Development List-Unsubscribe: List-Post: List-Help: List-Subscribe: MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 7bit X-VR-SPAMSTATE: OK X-VR-SPAMSCORE: 0 X-VR-SPAMCAUSE: gggruggvucftvghtrhhoucdtuddrfeekiedrkeejgddukecutefuodetggdotefrodftvfcurfhrohhfihhlvgemucfpfgfogfftkfevteeunffgnecuuegrihhlohhuthemuceftddtnecunecujfgurhepuffvfhfhkffffgggjggtgfesthejrgdttdefjeenucfhrhhomhepnfgruhhrvghnthcuuegvrhgtohhtuceoshhkrgdquggvvhgvlhesshhkrghrnhgvthdrohhrgheqnecurfgrrhgrmhepmhhouggvpehsmhhtphhouhht X-Virus-Scanned: ClamAV using ClamSMTP On 13/01/2016 04:47, Jude Nelson wrote: > I haven't tried this myself, but it should be doable. Vdev's > event-propagation mechanism is a small program that constructs a > uevent string from environment variables passed to it by vdev and > writes the string to the appropriate place. The vdev daemon isn't > aware of its existence; it simply executes it like it would for any > another matching device-event action. Another device manager could > supply the same program with the right environment variables and use > it for the same purposes. Indeed. My question then becomes: what are the differences between the string passed by the kernel (which is more or less a list of environment variables, too) and the string constructed by vdev ? In other words, is vdev itself more than a trivial netlink listener, and if yes, what does it do ? (I'll just take a pointer to the documentation if that question is answered somewhere.) For now I'll take a wild guess and say that vdev analyzes the MODALIAS or something, according to a conf file, in order to know the correct fan-out to perform and write the event to the correct subsystems. Am I close ? > Tmpfs and devtmps are designed for holding ephemeral state already, > so I'm not sure why the fact that they expose data as regular files > is a concern? Two different meanings of "ephemeral". tmpfs and devtmpfs are supposed to retain their data until the end of the system's lifetime. An event is much more ephemeral than that: it's supposed to be consumed instantly - like the event from the kernel is consumed instantly by the netlink listener. Files, even in a tmpfs, remain alive in the absence of a live process to hold them; but events have no meaning if no process needs them, which is the reason for the "event leaking" problem. Ideally, you need a file type with basically the same lifetime as a process. Holding event data in a file is perfectly valid as long as you have a mechanism to reclaim the file as soon as the last reference to it dies. > I couldn't think of a simpler way that was also as robust. Unless > I'm misunderstanding something, wrapping an arbitrary program to > clean up the files it created would, in the extreme, require coming > up with a way to do so on SIGKILL. I'd love to know if there is a > simple way to do this, though. That's where supervisors come into play: the parent of a process always knows when it dies, even on SIGKILL. Supervised daemons can have a cleaner script in place. For the general case, it shouldn't be hard to have a wrapper that forks an arbitrary program and cleans up /dev/metadata/whatever/*$childpid* when it dies. The price to pay is an additional process, but that additional process would be very small. You can still have a polling "catch-all cleaner" to collect dead events in case the supervisor/wrapper also died, but since that occurrence will be rare, the polling period can be pretty long so it's not a problem. > I went with a specialized filesystem for two reasons; both of which > were to fulfill libudev's API contract: * Efficient, reliable event > multicasting. By using hard-links as described above, the event only > needs to be written out once, and the OS only needs to store one > copy. That's a good mechanism; you're already fulfilling that contract with the non-eventfs implementation. > * Automatic multicast channel cleanup. Eventfs would ensure that no > matter how a process dies, its multicast state would be come > inaccessible and be reclaimed once it is dead (i.e. a subsequent > filesystem operation on the orphaned state, no matter how soon after > the process's exit, will fail). That's where storing events as files is problematic: files survive processes. But I still don't think a specific fs is necessary: you can either ensure files do not survive processes (see the supervisor/cleaner idea above), or you can use another Unix mechanism (see below). > Both of the above are implicitly guaranteed by libudev, since it > relies on a netlink multicast group shared with the udevd process > to achieve them. And honestly, that's not a bad design. If you want to have multicast, and you happen to have a true multicast IPC mechanism, might as well use it. It will be hard to be as efficient as that: if you don't have true multicast, you have to compromise somewhere. I dare say using a netlink multicast group is lighter than designing a FUSE filesystem to do the same thing. If you want the same functionality, why didn't you adopt the same mechanism ? (It can be made modular. You can have a uevent listener that just gets the event from the kernel and transmits it to the event manager; and the chosen event manager multicasts it.) > It is my understanding (please correct me if I'm wrong) that with > s6-ftrig-*, I would need to write out the event data to each > listener's pipe (i.e. once per struct udev_monitor instance), and I > would still be responsible for cleaning up the fifodir every now and > then if the libudev-compat client failed to do so itself. Is my > understanding correct? Yes and no. I'm not suggesting you to use libftrig for your purpose. :) * My concern with libftrig was never event storage: it was many-to-many notification. I didn't design it to transmit arbitrary amounts of data, but to instantly wake up processes when something happens; data transmission *is* possible, but the original idea is to send one byte at a time, for just 256 types of event. Notification and data transmission are orthogonal concepts. It's always possible to store data somewhere and notify processes that data is available; then processes can fetch the data. Data transmission can be pull, whereas notification has to be push. libftrig is only about the push. Leaking space is not a concern with libftrig, because fifodirs never store data, only pipes; at worst, they leak a few inodes. That is why a polling cleaner is sufficient: even if multiple subscribers get SIGKILLed, they will only leave behind a few fifos, and no data - so sweeping now and then is more than enough. It's different if you're storing data, because leaks can be much more problematic. * Unless you have true multicast, you will have to push a notification as many times as you have listeners, no matter what. That's what I'm doing when writing to all the fifos in a fifodir. That's what you are doing when linking the event into every subscriber's directory. I guess your subscriber library uses some kind of inotify to know when a new file has arrived? > Again, I would love to know of a simpler approach that is just as > robust. Whenever you have "pull" data transmission, you necessarily have the problem of storage lifetime. Here, as often, what you want is reference counting: when the last handle to the data disappears, the data is automatically collected. The problem is that your current handle, an inode, is not tied to the subscriber's lifetime. You want a type of handle that will die with the process. File descriptors fit this. So, an idea would be to do something like: - Your event manager listens to a Unix domain socket. - Your subscribers connect to that socket. - For every event: + the event manager stores the event into an anonymous file (e.g. a file in a tmpfs that is unlinked as soon as it is created) while keeping a reading fd on it + the event manager sends a copy of the reading fd, via fd-passing, to every subscriber. This counts as a notification, since it will wake up subscribers. + the event manager closes its own fd to the file. + subscribers will read the fd when they so choose, and they will close it afterwards. The kernel will also close it when they die, so you won't leak any data. Of course, at that point, you may as well give up and just push the whole event over the Unix socket. It's what udevd does, except it uses a netlink multicast group instead of a normal socket (so its complexity is independent from the number of subscribers). Honestly, given that the number of subscribers will likely be small, and your events probably aren't too large either, it's the simplest design - it's what I'd go for. (I even already have the daemon to do it, as a part of skabus. Sending data to subscribers is exactly what a pubsub does.) But if you estimate that the amount of data is too large and you don't want to copy it, then you can just send a fd instead. It's still manual broadcast, but it's not in O(event length * subscribers), it's in O(subscribers), i.e. the same complexity as your "hard link the event file" strategy; and it has the exact storage properties that you want. What do you think ? -- Laurent --- Unsubscribe: alpine-devel+unsubscribe@lists.alpinelinux.org Help: alpine-devel+help@lists.alpinelinux.org ---