noexcuses - runs cronjobs until they succeed
usage: noexcuses [options] [--] cmd args... options: -h - Print the help message then exit -V - Print the version message then exit -m - Print the man page then exit -w - Print the man page in html format then exit -r - Print the man page in nroff format then exit -q - Quiet mode (suppress warnings) -v - Verbose mode (announce each job with -x and -R) -d period - Specify the delay between attempts (default: 1h)
admin options: -l - Print outstanding jobs (mnemonic: list) -p - Run ps on outstanding jobs (mnemonic: processes) -k pid|all - Cancel outstanding jobs (mnemonic: kill) -x pid|all - Tell outstanding jobs to run now (mnemonic: execute) -f - Cancel outstanding jobs that were relocated (mn: forget) -M period - Print cronjobs lost due to downtime (mnemonic: missing) -R period - Run cronjobs lost due to downtime (mnemonic: recover) -D - For -M and -R, print source data for analysis (mn: data) -U user - Change to the given user (root only) (mnemonic: user) -H host - Override the hostname (mnemonic: host) -A host - Adopt the given host's jobs from the mirror (mnemonic: adopt)
Sometimes cronjobs fail to run successfully because a required server (like a database or FTP server) is temporarily unavailable due to power failures, hardware failures, software failures, network outages, choice of operating system, pilot error, and the like.
Typically, this results in someone being forced to examine crontabs and error reports, determine which cronjobs really need to be run, and then run them manually. This happened to me twice in one week. I don't want it to happen again. Cronjobs are meant to be automated and I want them to stay that way.
This is the rationale for noexcuses. It keeps track of cronjobs that have failed and keeps running them until they succeed. All you have to do is look at your crontabs, identify the cronjobs-that-must-succeed-no-matter-what and insert noexcuses before the command.
Then, when cron runs noexcuses, noexcuses will run the given cronjob. If the cronjob fails, noexcuses becomes a daemon that will retry the cronjob regularly until it succeeds. Even if the cron host is rebooted before the cronjob succeeds, noexcuses lets you restart all of the outstanding cronjobs. If you can't wait for the cron host to reboot, its outstanding cronjobs can be relocated to another cron host and be forgotten on the original host when it finally reboots. Also, if the cron host is down for a while, noexcuses can tell you which cronjobs missed out on running while it was down and run them. The initscript noexcuses.init can make all these things happen automatically at boot time.
In other words, noexcuses is a free, lightweight, fine-grained, unobtrusive, high-availability tool for cronjobs. Or rather, it's a high-recoverability tool for cronjobs which can either be incorporated into a highly available system or used in the absence of one.
-h
Print the help message then exit.
-V
Print the version message then exit.
-m
Print the man page then exit. This is equivalent to executing man
noexcuses
but this works even when the man page isn't installed.
-w
Print the man page in html format then exit. This lets you install the man page in html format with a command like:
mkdir -p /usr/share/doc/noexcuses/html && noexcuses -w > /usr/share/doc/noexcuses/html/noexcuses.1.html
-r
Print the man page in nroff format then exit. This lets you install the man page with a command like:
noexcuses -r > /usr/share/man/man1/noexcuses.1
-q
Quiet mode. Suppress warnings. Not recommended. Warnings are serious.
-v
Verbose mode. Announces each job before executing them with the -x
and
-R
options.
-d
period
Specify the delay between attempts to run the command. The default is 1h
which means 1 hour. The period argument looks like:
[#d][#h][#m][#s]
At least one of the above components must be present. If more than one is
present, they must appear in the order shown above. If the given time period
is invalid, it is ignored and the default period of 1h
is used instead
(and a warning is issued). The actual delay between attempts is always at
least one minute.
-l
Print outstanding jobs then exit. You will need to run this as the root
user to see the outstanding jobs of other users. This enables you to see the
pid, owner and command of outstanding jobs. A simple way to be notified that
jobs are failing (as if you didn't already know) is to put something like
this in your crontab:
*/5 * * * * noexcuses -l
If it produces any output, cron will send it to you in an email. A remote version is also a good idea.
-p
Run ps on outstanding jobs. This only shows process details for
outstanding jobs for which a noexcuses process is currently running. If
the -p
output doesn't list a process for every outstanding job listed by
the -l
option, then some noexcuses processes are missing and you need
to use the -x
option to start them again.
-k
pid|all
Cancel outstanding jobs. If the option's argument is a pid, then only the
corresponding job is cancelled (i.e. its process is killed and its state
file is removed). If the option's argument is all
, then all of your
outstanding jobs will be cancelled. You will need to run this as the root
user to cancel the outstanding jobs of other users. Note that if a pid is
given and the current process with that pid doesn't look like the intended
noexcuses process, it will not be killed as this is a sign that the
process was killed by other means or that the cron host has been rebooted
and the pid now refers to an unrelated process.
-x
pid|all
Tell outstanding jobs to try again now. If the option's argument is a pid,
then only the corresponding job is executed. If the option's argument is
all
, then all of your outstanding jobs will be executed. You will need to
run this as the root
user to execute the outstanding jobs of other users.
Note that this works even if the noexcuses process referred to by pid is no longer running (i.e. after a reboot). In this case, it reconstructs the original call to noexcuses and invokes it.
-f
Cancel outstanding jobs that were relocated to another cron host. Relocated
jobs are defined to be those that are outstanding but for which there is no
corresponding saved state in the mirror directory. The inference is that
another cron host has completed this cron host's outstanding jobs. This
option cannot be used unless a mirror directory is configured in
/etc/noexcuses.conf
. This option can't be used when the mirror directory
is not mounted. If it is, for safety, no jobs are cancelled. The mirror
directory is deemed to be mounted if the log file is present.
-M
period
Print cronjobs that were lost due to downtime. If the cron host is down
during a period of time in which cronjobs were scheduled to start, those
cronjobs miss out entirely. This option examines your crontab and compares
it against noexcuses.log
to determine which cronjobs guarded by
noexcuses missed out. You will need to run this as the root
user to
examine all crontabs on the cron host. The period argument specifies how
far back in time to look for missing cronjobs. It has the same form as the
-d
option's period argument described above.
-R
period
Run cronjobs that were lost due to downtime. This option is just like the
-M
option except that instead of printing the lost cronjobs, it executes
them. Note that it also adds fake log entries to the log file to simulate
the job having run at the correct time. This is to prevent multiple uses of
the -R
option from running the same jobs multiple times. When the missing
jobs are run, they will also add an entry to the log file as usual. To be
polite, there is a delay of one second between each job being started.
-D
For -M
and -R
, print source data for analysis. This option can only be
used in conjunction with the -M
or -R
option. It prints the cronjobs
that were scheduled to execute during the specified time period. It also
prints the log file entries that were produced in the specified time period.
Use this option to verify for yourself that the output of the -M
option
is correct before using the -R
option. Never use the -R
option until
you know exactly what it is going to do and you know that this is the
correct thing for it to do. If it isn't correct, you can run the correct
commands manually but at least noexcuses will have made it easy by
presenting them all to you.
-U
user
Change to the given user. Used in conjunction with other options, this
option allows the root
user to list, cancel and run all of the
outstanding or missing jobs of a particular user rather all users. This is
intended to be used with the administrative options. It can also be used to
launch a cronjob as another user but it's not a good idea. Doing this breaks
the -M
and -R
options when they are used by the other user or when
they are used by the root
user in conjunction with the -U
option to
specify that user. The reason is that the cronjob will be logged in the
other user's log file but the cronjob itself is in the root
user's
crontab where the other user can't see it. So when the other user uses the
-M
or -R
option to identify or run their missing jobs, any jobs in the
root
user's crontab that are run as that user cannot be examined. If such
jobs are missing, they will not be reported or started. So don't use the
-U
option to run new jobs as another user unless the other user has no
intention of using the -M
or -R
options and the root
user has no
intention of using the -M
or -R
options in conjunction with the -U
option.
-H
host
Override the hostname. This might be useful when multiple cron hosts share
the same physical directory for their log and state files. When this is the
case and the -H
is not used, only the files for the current host are
examined. See the spool
parameter in the FILES section. This is
intended to be used with the administrative options but can also be used in
normal usage to lie about what host a job is to run on should the need to do
so arise.
-A
host
Adopt the given host's jobs from the mirror directory. This will rename all
of the given host's state files in the mirror directory. This is useful when
a primary cron host has gone down and you want to relocate its outstanding
jobs to a secondary cron host. Both cron hosts must share the mirror
directory. After adopting the primary cron host's outstanding jobs, using
the -x
all option will restart them on the secondary cron host. To
start cronjobs that missed out while the primary cron host was down,
identical crontabs must be installed and then the -R
option must be used
together with the -H
option naming the primary cron host.
Example: Host cron1
goes down. Its cronjobs are relocated to host cron2
:
On host cron2
, adopt host cron1
's outstanding jobs:
noexcuses -A cron1 noexcuses -x all noexcuses -H cron1 -R 1h
On host cron1
, after it has rebooted:
noexcuses -f noexcuses -x all noexcuses -R 1h
The configuration files for noexcuses are /etc/noexcuses.conf
and
~/.noexcusesrc
. They can contain shell-style comments (i.e. # to end of
line) and blank lines. They can contain the following parameters:
delay = 1h spool = /var/noexcuses mirror = /nfs/fileserver/noexcuses cronpath = /var/spool/cron/crontabs pslp = ps -lp psfp = ps -fp path = /usr/bin:/bin: supath = /usr/bin:/bin:
The delay
parameter specifies the default delay between attempts to run a
cronjob that has failed.
The spool
parameter specifies the directory in which noexcuses stores
its state and log files. This directory must be writable by all users of
noexcuses and it must have the sticky bit set (like /tmp
and
/var/tmp
do). This directory must be created in advance by the root
user. If not, the /var/tmp
directory will be used instead. /var/tmp
must exist just in case. I really mean that.
If you override the spool
parameter in your own ~/.noexcusesrc
file,
then your outstanding cronjobs will not be restarted automatically at boot
time but you can still restart them yourself with the -x
option. You
should also override the mirror
parameter if you override the spool
parameter and the mirror
parameter is supplied in /etc/noexcuses.conf
.
There are also log files in the directory specified by the spool
parameter. This is crucial for the -M
and -R
options. Don't delete
them after a reboot before you have recovered any lost cronjobs. So using
/var/tmp
for the spool
is a bad idea.
The mirror
parameter specifies a separate directory (at least on a
physically separate disk but preferably on a physically separate file
server). This is to help you recover when faced with disk failure. If the
log files, state files and crontabs are stored elsewhere, then the jobs can
be restarted on another host without having to wait for the original cron
host to be repaired or replaced.
The cronpath
parameter specifies the directory containing all of the
crontab files.
The pslp
parameter specifies the ps command to use that provides the
``long'' format needed by the -k
and -x
options. The default is ps -lp
.
The psfp
parameter specifies the ps command to use that provides the
``full'' format needed by the -l
option. The default is ps -fp
.
The path
and supath
parameters specify, for normal users and the super
user respectively, the PATH
environment variable to be used when starting
noexcuses processes with the -x
and -R
options. This should match
the PATH
environment variable that cron supplies to your cronjobs.
The configuration file for noexcuses.init is
/etc/default/noexcuses.init
. It is a shell script that is executed by
noexcuses.init. It can contain the following variables:
noexcuses # Override the location of the noexcuses command period # Default period between attempts to run a failing job (Default 1h) restart # Restart all known jobs on boot? (i.e. -x) (Default no) recover # Recover missing jobs on boot? (i.e. -R) (Default no) relocated # Forget jobs that were relocated? (i.e. -f) (Default no) cancel # Cancel all known jobs on boot? (i.e. -k) (Default no)
See the RELOCATING CRONJOBS TO ANOTHER CRON HOST
for more information
on the relocated
variable.
If noexcuses tries to run its cronjob and fails, it saves its state to
disk and then retries the cronjob every now and then until it succeeds. But
the cron host may be rebooted before it succeeds. When this happens, you can
use the -x
option to restart all of the outstanding jobs.
Additionally, if any cronjobs guarded by noexcuses were scheduled to
start during the time that the cron host was down, the aforementioned saving
of state to disk will not have taken place because cron didn't get a
chance to start noexcuses. When this happens, you can use the -M
and
-R
options to run any cronjobs that were scheduled to start during the
time that the cron host was down.
Both of these actions can be automated at boot time by installing
noexcuses.init as an initscript and configuring it (in
/etc/default/noexcuses.init
) to restart outstanding jobs and recover
missing jobs. By default, noexcuses.init does nothing at boot time. See
the BUGS and RACE CONDITIONS sections for the reason.
If your cron host is going to be down for longer than your cronjobs can afford to wait, you can relocate them to another cron host. The other cron host needs to have the same software installed in the same places and the same crontabs (or readily accessible copies) as the original cron host. Most importantly, the other cron host must have access to the original cron host's spool or mirror directory and it must use a copy of it as its own spool directory.
Given all of that, the original cron host's outstanding jobs can be started
up on the other cron host by using noexcuses with the -H
and -x
options. And if the other cron host has the original cron host's crontabs
installed, then any jobs that missed out on being run by the original cron
host can be recovered on the other cron host with the -H
and -R
options.
When the original cron host comes up again, the -f
option can then be
used to make it forget its outstanding jobs that were relocated and
completed successfully on the other cron host. Note that if there were jobs
that had been relocated but that have not yet completed successfully on the
other cron host would also be restarted on the original cron host if the
-x
option where used of the original cron host, even after the -f
option has been used. If the -R
option has been used on the other cron
host, then it should not be used on the original cron host.
In terms of the init script noexcuses.init
and its configuration
/etc/default/noexcuses.init
, if you use them at all, you can either turn
on (restart
and/or recovered
) or (relocated
and maybe cancel
)
but not both.
Anacron serves a different purpose to noexcuses. Anacron seems to be for home computers that aren't always up and where nobody is going to be upset if their cronjobs don't run when scheduled (otherwise it wouldn't be expected for the host to be down). What it does is to ensure that too much time doesn't pass between invocations of periodic jobs assuming that the host is up occasionally. The success or failure of those jobs is of no more importance to anacron than it is to cron itself.
Noexcuses is for production systems where the cron host is always up (unless there's a catastrophe) and cronjobs require access to other hosts that are always up (unless there's a catastrophe) and there are people depending on cronjobs being run either when scheduled or as soon as possible after any catastrophes are resolved. What it does is to ensure that cronjobs run successfully even in the face of disasters in the machine room.
You might be surprised that noexcuses provides no option to name a
cronjob. Naming is usually desired to ensure uniqueness. In other words, a
name option could ensure that, if an attempt to run a named cronjob fails,
then any subsequently cronned instances of identically named cronjobs would
be suppressed until the outstanding job of the same name finally succeeds.
This might seem desirable but it is inappropriate for the sort of cronjobs
that need to be guarded by noexcuses. These jobs are the ones that must
run no matter what. That's why you would use noexeucses with them. If
suppressing cronjobs were acceptable, then it would not make sense to use
noexcuses at all for that cronjob. In fact, if it were acceptable to run
just any one instance of the cronjob, then just letting the cronjob fail
until the problem is resolved (after which it can run successfully) would be
equally acceptable. In other words, if you think that a name option is
needed for a particular cronjob, then that cronjob probably doesn't need
noexcuses. Consider daemon instead (http://libslack.org/daemon/
).
You must make sure that cronjobs that are to run via noexcuses are written appropriately. Cronjobs that perform a single action using a single remote server are probably fine. However, cronjobs can get more complicated than that. A cronjob that requires two remote servers, either or both of which may fail, should be written so as to not waste time on subsequent attempts to run, nor should they modify external data multiple times if that would not be a good thing to do. For example, a cronjob that updates and queries a remote database, saves the data to disk, and then sends the data to a remote FTP server has two main points of failure to consider: the database and the FTP server. The cronjob should be written so that if only the FTP server fails, the database activity isn't repeated on subsequent attempts to run the cronjob. At best, it's inefficient. At worst, performing the database updates multiple times could be a very bad thing indeed. In other words, all cronjobs should be efficiently idempotent and failsafe. And of course, they must communicate their failure to their caller, noexcuses, by exiting with a non-zero return code.
You need to be wary when the cron host is part of a cluster. There are many
different types of cluster and I can't offer specific instructions on making
noexcuses work in your particular cluster environment except to advise
you to know how your cluster works and to know how noexcuses works and to
make sure that they will work together. For safety in a clustered
environment, it's probably best not to use noexcuses.init. Instead, run
noexcuses
manually on one of the hosts in the cluster after a reboot (at
least until you can be sure that you know what you are doing).
Also, be aware that noexcuses doesn't know when you have changed your
crontab or what it looked like before. For the purposes of the -M
and
-R
options, noexcuses assumes that the current crontab has been in
place unchanged for the entire time period specified on the command line.
This would normally be the case. If it is not the case you can get many
false positives. Don't believe them. Even better, don't change your crontabs
before recovering lost jobs after a reboot of the cron host.
The following only applies to the -M
and -R
options.
Doesn't examine atjobs to look for any jobs that should have started while the host was down. So don't use noexcuses in atjobs.
Doesn't examine the (Vixie cron) system crontab (/etc/crontab
) to look
for any jobs that should have started while the host was down. So don't use
noexcuses in /etc/crontab
. Similarly, /etc/anacrontab
is ignored.
There is only partial support for Vixie cron syntax extensions. The user
column in /etc/crontab
isn't supported. Mapping %
to return in
commands isn't supported. However, the following extensions are supported:
environment variables, step values (e.g. */5
and 0-23/2
); day and
month names (first three letters thereof); and the special strings (i.e.
@yearly
, @annually
, @monthly
, @weekly
, @daily
, @midnight
and @hourly
(but not @reboot
)).
Doesn't replicate shell parsing of commands. This matters when comparing crontab entries against the log file entries to identify which scheduled jobs were not logged as having run. If the command in the crontab doesn't exactly match a log file entry, noexcuses will think that the job hasn't run and that it needs to be run. So make sure that your cronjobs are simple commands. Don't use any shell meta-characters except for a single space between each argument. In other words:
cmd arg1 arg2 arg3 # GOOD (no metacharacters) cmd arg1 arg2 'arg3' # BAD (quotes) cmd arg1 arg2 arg3 # BAD (extra spaces) cmd arg1 'ar g2' arg3 # BAD (quotes, spaces) sh -c "cmd \"a b c\" && echo yes || echo no" # BAD (everything)
I cannot guarantee that the -R
option will detect every job that missed
out on being run nor that it will not detect any jobs that did run. It works
provided that cron starts jobs within 1 minute of the time for which they
are scheduled and provided there's no disk corruption of the log file and
provided that the cron host doesn't crash at a particularly bad time. See
the RACE CONDITIONS section to find out when that is. None of that can be
guaranteed. Always check with the -M
option first. This will print a list
of commands that the -R
option would execute. Use the -D
option as
well to verify the list of missing jobs yourself. If the list is wrong, run
the correct commands manually.
In particular, after a cron host crash, determine if a cronjob could have
been running at the time of the crash and somehow verify whether or not it
completed before noexcuses could log it. If you don't, and just run
noexcuses with the -R
option, you may end up running a cronjob that
did already complete successfully.
Similarly, after a cron host crash, determine if a cronjob was logged as
having run unsuccessfully at the time of the crash but for which there is no
corresponding state file. If you don't, and just run noexcuses with the
-R
or -x
option, you may miss out on running a cronjob that has not
yet been run successfully.
Also, after a cron host crash, determine if a cronjob was logged as having
run unsuccessfully before the cron host crash and determine somehow whether
or not it might have just completed successfully at the time of the cron
host crash, before its statefile could be removed. If you don't, and just
run the noexcuses with the -x
option, you may end up running a cronjob
that did already complete successfully.
For these reasons, the noexcuses.init initscript shouldn't really be used at all. Recovery after a cron host crash should be done manually unless you know that no harm will be done if any cronjob runs an extra time and you know that you can subsequently identify any cronjob that didn't run and restart those manually. The default behaviour of noexcuses.init is to do nothing at boot time. Resist the temptation to change that unless the risk is acceptable.
The -x
option works by examining the state files (as do the -l
, -p
and -k
options). The -M
and -R
options work by examining the log
file and crontabs.
Log entries are written after the first attempt terminates (whether
successfully or not). The state file is written after the first attempt
terminates unsuccessfully. This ``ensures'' that, in the event of a cron host
crash or reboot, one and only one of the -x
and -R
options will start
a given cronjob. If the cron host crashes before or during the first
attempt, -R
will run it. If the cron host crashes after the first attempt
failed, -x
will run it.
However, there are three race conditions. If the cron host crashes after a
cronjob's first attempt terminates successfully but before the log file
entry has been written, then the -R
option after rebooting will run the
cronjob even though it terminated successfully. If the cron host crashes
after the first attempt terminates unsuccessfully and after the log file
entry has been written but before the state file is written, then neither
the -R
option nor the -x
option after rebooting will run the cronjob
even though it did not terminate successfully. If the cron host crashes
after an initially unsuccessful cronjob finally succeeds but before its
state file is removed, then the -x
option after rebooting will run the
cronjob even though it terminated successfully. The windows of opportunity
here are very small but they are there nonetheless. Disk corruption of the
log and state files will also ruin your day. So make use of the mirror
parameter in the configuration file.
So noexcuses doesn't really ensure anything but it is a vast improvement over cron by itself. Even so, please monitor your uninterruptible power supplies and generators and test them regularly. I really mean that. If that's too much to ask, then sudden power failures are bound to happen and may trigger one of the race conditions.
I wanted to call this program golem because it makes other programs more determined than they would otherwise be to obey the words in their heads. It seems to me to be a singularly appropriate name. However, there are already too many programs out there called golem. And anyway, all software are golem. This program just makes them better golems. Of all the other names I thought of, noexcuses seemed to be the most sensible.
cron(8), crontab(1), crontab(5), anacron(8), init(8), daemon(1)
20080328 raf <raf@raf.org>