From: "Laurent Bercot" <ska-devel@skarnet.org>
To: "Natanael Copa" <ncopa@alpinelinux.org>, "Rasmus Thomsen" <oss@cogitri.dev>
Subject: Re: Use of supervise-daemon in Alpine
Cc: "Francesco Colista" <fcolista@alpinelinux.org>, Leonardo <rnalrd@gmail.com>,
 ~alpine/devel@lists.alpinelinux.org, =?utf-8?q?S=c3=b6ren=20Tempel?=
 <soeren@soeren-tempel.net>
Date: Thu, 27 Aug 2020 15:35:34 +0000
Message-Id: <em57aef832-b49c-4f20-b082-b7cd986daede@elzian>
In-Reply-To: <20200827171314.5bca06cf@ncopa-desktop.lan>
References: <dea709f7-94b7-f02c-929a-f7368f05bf6d@gmail.com>
 <3LLUI2KOULSYM.359WA6HATX45B@8pit.net>
 <20200821191507.7857010b@ncopa-macbook.copa.dup.pw>
 <ff2f9139bf743abd7303b89f10fc9549@alpinelinux.org>
 <799e151a9764838b5b0e273da3626e471976edb7.camel@cogitri.dev>
 <20200827171314.5bca06cf@ncopa-desktop.lan>
Reply-To: "Laurent Bercot" <ska-devel@skarnet.org>
User-Agent: eM_Client/8.0.3385.0
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: quoted-printable


>But that would not give sysadmin/user the choice to die on error, which
>I fear will lead to nobody caring if the services are buggy or not. The
>"fix" is to restart the service.

  That's a classic administration mistake, and it absolutely on the
sysadmin or ops person, not on the supervision infrastructure.

  A supervision system does not exist so that services can restart when
they die and the admin can continue napping because who cares, the
service is up.
  A supervision system exists so that services can restart when they
die so they're still kinda functional in an ever-imperfect world while
the admin actually analyzes the error and finds a real fix for the
service.

  The goal of a supervision system is to maximize the uptime. It is not
to enable laziness in fixing bugs. If nobody cares that a service is
buggy, you can lay the full blame on the people who do not care; not on
the supervision system. Not supervising daemons by default is putting
more the burden on competent admins in order to cater to the others,
and madness lies down this path.

  Of course, services should be configured so that if they crash,
appropriate notifications are sent to the admin, so problems will not
be silently ignored. supervise-daemon should have a hook you can use
to take some action depending on the exit code (or signal) of the=20
daemon.

  Longruns should be supervised, but if some admin does not want to
supervise a given service, there should be an interface allowing them
to tell the supervisor not to restart the service next time it dies.
supervise-daemon should have such an interface, you shouldn't need to
patch it.

  *cough* Needless to say, s6 provides all of this. *cough*

--
  Laurent