NAME

noexcuses - runs important cronjobs until they succeed

SYNOPSIS

 usage: noexcuses [options] [--] cmd args...
 options:

   -h, --help    - Print the help message then exit
   -V, --version - Print the version message then exit
   -m            - Print the man page then exit
   -r            - Print the man page in nroff format then exit
   -w            - Print the man page in HTML format then exit

   -C configpath - Override the default config file: /etc/noexcuses.conf
   -d period     - Override the default delay between attempts: 1h

 admin options:

   -l         - Print outstanding jobs (mnemonic: list)
   -p         - Run ps on outstanding jobs (mnemonic: processes)
   -x pid|all - Tell outstanding jobs to run now (mnemonic: execute)
   -c pid|all - Cancel outstanding jobs (mnemonic: cancel)
   -A host    - Adopt the given host's jobs from the mirror (mn: adopt)
   -F         - Cancel outstanding jobs that were relocated (mn: forget)
   -M period  - Print cronjobs lost due to downtime (mnemonic: missing)
   -R period  - Run cronjobs lost due to downtime (mnemonic: recover)
   -D         - For -M and -R, print source data for analysis (mn: data)
   -U user    - Change to the given user (root only) (mnemonic: user)
   -H host    - Override the hostname (mnemonic: host)
   -v         - Verbose mode (announce each job with -x -c -A -F -R)
   -q         - Quiet mode (suppress warnings)

DESCRIPTION

Sometimes cronjobs fail to run successfully because a required server (like a database or FTP server) is temporarily unavailable due to power failures, hardware failures, software failures, network outages, choice of operating system, pilot error, and the like.

Typically, this results in someone being forced to examine crontabs and error reports, determine which cronjobs really need to be run, and then run them manually. This happened to me twice in one week. I don't want it to happen again. Cronjobs are meant to be automated and I want them to stay that way.

This is the rationale for noexcuses. It keeps track of cronjobs that have failed and keeps running them until they succeed. All you have to do is look at your crontabs, identify the cronjobs-that-must-succeed-no-matter-what and insert noexcuses before the command.

Then, when cron runs noexcuses, noexcuses will run the given cronjob. If the cronjob fails, noexcuses becomes a daemon that will retry the cronjob regularly until it succeeds. Even if the cron host is rebooted before the cronjob succeeds, noexcuses lets you recover (or cancel) all of the outstanding cronjobs. If you can't wait for the cron host to reboot, its outstanding cronjobs can be relocated to another cron host and be forgotten on the original host when it finally reboots.

Also, if the cron host is down for a while, noexcuses can tell you which cronjobs missed out on running while it was down and run them. The initscript noexcuses.init can make all these things happen automatically at boot time.

In other words, noexcuses is a free, lightweight, fine-grained, unobtrusive, high-availability tool for cronjobs. Or rather, it's a high-recoverability tool for cronjobs which can either be incorporated into a highly available system or used in the absence of one.

OPTIONS

-h, --help

Print the help message then exit.

-V, --version

Print the version message then exit.

-C configpath

Override the default config file: /etc/noexcuses.conf or /usr/local/etc/noexcuses.conf. See the FILES section below.

-m

Print the man page then exit. This is equivalent to executing man noexcuses but this works even when the man page isn't installed.

-w

Print the man page in HTML format then exit. This lets you install the man page in HTML format with a command like:

 mkdir -p /usr/share/doc/noexcuses/html &&
 noexcuses -w > /usr/share/doc/noexcuses/html/noexcuses.1.html
-r

Print the man page in nroff format then exit. This lets you install the man page with a command like:

 noexcuses -r > /usr/share/man/man1/noexcuses.1
-q

Quiet mode. Suppress warnings. Not recommended. Warnings are serious.

-v

Verbose mode. Announces each job before executing, cancelling, recovering, adopting or forgetting them with the -x, -c, -R, -A and -F options.

-d period

Specify the delay between attempts to run the command. The default is 1h (i.e. 1 hour). The period argument looks like:

 [#d][#h][#m][#s]

At least one of the above components must be present. If more than one is present, they must appear in the order shown above. The final "s" for seconds is optional. If the given time period is invalid, it is ignored and the default period of 1h is used instead (and a warning is issued).

-l

Print any outstanding jobs then exit. You will need to run this as the root user (and with the -U user option) to see the outstanding jobs of other users. This enables you to see the pid, owner and command of any outstanding jobs. A simple way to be notified that jobs are failing (as if you didn't already know) is to put something like this in your crontab:

 */5 * * * * noexcuses -l

If it produces any output, cron will send it to you in an email. A remote version is also a good idea.

-p

Run ps on any outstanding jobs. You will need to run this as the root user (and with the -U user option) to run ps on the outstanding jobs of other users. This only shows process details for outstanding jobs for which a noexcuses process is currently running. If the -p output doesn't list a process for every outstanding job listed by the -l option, then some noexcuses processes are missing and you need to use the -x option to start them again.

-x pid|all

Tell outstanding jobs to try again now. If the option's argument is a pid, then only the corresponding job is executed. If the option's argument is all, then all of your outstanding jobs will be executed. You will need to run this as the root user (and with the -U user option) to execute the outstanding jobs of other users.

Note that this works even if the noexcuses process referred to by pid is no longer running (i.e. after a reboot). In this case, it reconstructs the original call to noexcuses and invokes it.

-c pid|all

Cancel outstanding jobs. If the option's argument is a pid, then only the corresponding job is cancelled (i.e. its noexcuses process is killed and its state file is removed). If the option's argument is all, then all of your outstanding jobs will be cancelled. You will need to run this as the root user (and with the -U user option) to cancel the outstanding jobs of other users. Note that if a pid is given, and the current process with that pid doesn't look like the intended noexcuses process, it will not be killed, as this is a sign that the process was killed by other means, or that the cron host has been rebooted and the pid now refers to an unrelated process.

-A host

Adopt the given host's jobs from the mirror directory. This will rename all of the given host's state files in the mirror directory. This is useful when a primary cron host has gone down, and you want to relocate its outstanding jobs to a secondary cron host. Both cron hosts must share the mirror directory. After adopting the primary cron host's outstanding jobs, using the -x all option will restart them on the secondary cron host. To start cronjobs that missed out while the primary cron host was down, identical crontabs must be installed (if they are not already present) and then the -R option must be used together with the -H option naming the primary cron host.

Example: Host cron1 goes down. Its cronjobs are relocated to host cron2:

On host cron2, adopt host cron1's outstanding jobs:

 noexcuses -A cron1 -v
 noexcuses -x all -v
 noexcuses -H cron1 -R 1h -v

On host cron1, after it has rebooted:

 noexcuses -F -v
 noexcuses -x all -v
 noexcuses -R 1h -v
-F

Cancel outstanding jobs that were relocated to another cron host while the current cron host was down. Relocated jobs are defined to be those that are outstanding, but for which there is no corresponding saved state in the mirror directory. The inference is that another cron host has adopted this cron host's outstanding jobs. This option cannot be used unless a mirror directory is configured in /etc/noexcuses.conf (or /usr/local/etc/noexcuses.conf). This option can't be used when the mirror directory is not mounted. If it is, for safety, no jobs are cancelled. The mirror directory is deemed to be mounted if the log file is present.

-M period

Print cronjobs that were lost due to downtime. If the cron host is down during a period of time in which cronjobs were scheduled to start, those cronjobs miss out entirely. This option examines your crontab and compares it against the logfile noexcuses.<hostname>.<username>.log to determine which cronjobs guarded by noexcuses missed out.

You will need to run this as the root user to examine all crontabs on the cron host. The period argument specifies how far back in time to look for missing cronjobs. It has the same form as the -d option's period argument described above. If the given time period is invalid, it is ignored and the default period of 12h is used instead (and a warning is issued).

-R period

Run cronjobs that were lost due to downtime. This option is just like the -M option except that instead of printing the lost cronjobs, it executes them. Note that it also adds fake log entries to the log file to simulate the job having run at the correct time. This is to prevent multiple uses of the -R option from running the same jobs multiple times. When the missing jobs are run, they will also add an entry to the log file as usual.

You will need to run this as the root user to examine all crontabs on the cron host. The period argument specifies how far back in time to look for missing cronjobs. It has the same form as the -d option's period argument described above. If the given time period is invalid, it is ignored and the default period of 12h is used instead (and a warning is issued).

-D

For -M and -R, print source data for analysis. This option can only be used in conjunction with the -M or -R option. It prints the cronjobs that were scheduled to execute during the specified time period. It also prints the log file entries that were produced in the specified time period. Use this option to verify for yourself that the output of the -M option is correct before using the -R option. Never use the -R option until you know exactly what it is going to do and you know that this is the correct thing for it to do. If it isn't correct, you can run the correct commands manually but at least noexcuses will have made it easy by presenting them all to you.

-U user

Change to the given user. Used in conjunction with other options, this option allows the root user to list, cancel and run all of the outstanding or missing jobs of a particular user rather all users. This is intended to be used with the administrative options. It can also be used to launch a cronjob as another user but it's not a good idea. Doing this breaks the -M and -R options when they are used by the other user or when they are used by the root user in conjunction with the -U option to specify that user. The reason is that the cronjob will be logged in the other user's log file but the cronjob itself is in the root user's crontab where the other user can't see it. So when the other user uses the -M or -R option to identify or run their missing jobs, any jobs in the root user's crontab that are run as that user cannot be examined. If such jobs are missing, they will not be reported or started. So don't use the -U option to run new jobs as another user unless the other user has no intention of using the -M or -R options and the root user has no intention of using the -M or -R options in conjunction with the -U option.

-H host

Override the hostname. This might be useful when multiple cron hosts share the same physical directory for their log and state files. When this is the case and the -H is not used, only the files for the current host are examined. See the spool parameter in the FILES section. This is intended to be used with the administrative options but can also be used in normal usage to lie about what host a job is to run on should the need to do so arise.

EXAMPLES

noexcuses(1) is prepended before the real command in crontabs:

 # m h  dom mon dow   command
 0 1 * * * noexcuses cmd arg...
 0 2 * * * noexcuses -d 4h cmd arg...
 0 3 * * * noexcuses -C /home/me/noexcuses.conf cmd arg...

Command names and arguments cannot contain spaces or any other shell meta-characters.

FILES

The default configuration files for noexcuses are /etc/noexcuses.conf (or /usr/local/etc/noexcuses.conf), and ~/.noexcusesrc. This can be overridden with the -C configpath option. They can contain shell-style comments (i.e. # to end of the line) and blank lines. They can contain the following parameters:

 delay = 1h
 spool = /var/noexcuses
 mirror = /nfs/fileserver/noexcuses
 cronpath = /var/spool/cron/crontabs
 pslong = ps -fp
 psfull = ps -fp
 path = /usr/bin:/bin:
 supath = /usr/bin:/bin:

The delay parameter specifies the default delay between attempts to run a cronjob that has failed.

The spool parameter specifies the directory in which noexcuses stores its state and log files. This directory must be writable by all users of noexcuses and it must have the sticky bit set (like /tmp and /var/tmp do). This directory must be created in advance by the root user. If not, the /var/tmp directory will be used instead. /var/tmp must exist just in case. I really mean that.

If you override the spool parameter in your own ~/.noexcusesrc file, then your outstanding cronjobs will not be restarted automatically at boot time but you can still restart them yourself with the -x option. You should also override the mirror parameter if you override the spool parameter and the mirror parameter is supplied in /etc/noexcuses.conf.

There are also log files in the directory specified by the spool parameter. This is crucial for the -M and -R options. Don't delete them after a reboot before you have recovered any lost cronjobs. So using /var/tmp for the spool is a bad idea.

The mirror parameter specifies a separate directory (at least on a physically separate disk but preferably on a physically separate file server). This is to help you recover when faced with disk failure. If the log files, state files and crontabs are stored elsewhere, then the jobs can be restarted on another host without having to wait for the original cron host to be repaired or replaced.

The cronpath parameter specifies the directory containing all of the crontab files.

The pslong parameter specifies the ps command to use that provides the "long" format needed by the -c and -x options. The default is ps -fp.

The psfull parameter specifies the ps command to use that provides the "full" format needed by the -l option. The default is ps -fp.

The path and supath parameters specify, for normal users and the super user respectively, the PATH environment variable to be used when starting noexcuses processes with the -x and -R options. This should match the PATH environment variable that cron supplies to your cronjobs.

The configuration file for /etc/init.d/noexcuses is /etc/default/noexcuses. It is a shell script that is sourced by /etc/init.d/noexcuses. It can contain the following variables:

 noexcuses # Override the location of the noexcuses command
 period    # Default period between attempts to run a failing job (Default 1h)
 relocated # Forget jobs that were relocated? (i.e. -F) (Default no)
 cancel    # Cancel all known jobs on boot? (i.e. -c all) (Default no)
 restart   # Restart all known jobs on boot? (i.e. -x all) (Default no)
 recover   # Recover missing jobs on boot? (i.e. -R) (Default no)

See the RELOCATING CRONJOBS TO ANOTHER CRON HOST section below for more information on the relocated variable.

PERSISTENCE ACROSS REBOOTS

If noexcuses tries to run its cronjob and fails, it saves its state to disk and then retries the cronjob every now and then until it succeeds. But the cron host might be rebooted before it succeeds. When this happens, you can use the -x option to restart all of the outstanding jobs.

Additionally, if any cronjobs guarded by noexcuses were scheduled to start during the time that the cron host was down, the aforementioned saving of state to disk will not have taken place because cron didn't get a chance to start noexcuses. When this happens, you can use the -M and -R options to run any cronjobs that were scheduled to start during the time that the cron host was down.

Both of these actions can be automated at boot time by installing /etc/init.d/noexcuses as an initscript and configuring it (in /etc/default/noexcuses) to restart outstanding jobs and recover missing jobs. By default, /etc/init.d/noexcuses does nothing at boot time. See the BUGS and RACE CONDITIONS sections below for the reason.

RELOCATING CRONJOBS TO ANOTHER CRON HOST

If your cron host is going to be down for longer than your cronjobs can afford to wait, you can relocate them to another cron host. The other cron host needs to have the same software installed in the same places and the same crontabs (or readily accessible copies) as the original cron host. Most importantly, the other cron host must have access to the original cron host's spool or mirror directory and it must use a copy of it as its own spool directory.

Given all of that, the original cron host's outstanding jobs can be started up on the other cron host by using noexcuses with the -H and -x options. And if the other cron host has the original cron host's crontabs installed, then any jobs that missed out on being run by the original cron host can be recovered on the other cron host with the -H and -R options.

When the original cron host comes up again, the -F option can then be used to make it forget its outstanding jobs that were relocated and completed successfully on the other cron host. Note that if there were jobs that had been relocated but that have not yet completed successfully on the other cron host, they would also be restarted on the original cron host if the -x option where used on the original cron host, even after the -F option has been used. If the -R option has been used on the other cron host, then it should not be used on the original cron host.

In terms of the initscript /etc/init.d/noexcuses and its configuration /etc/default/noexcuses, if you use them at all, you can either turn on (restart and/or recovered) or (relocated and maybe cancel) but not both.

IS THIS LIKE ANACRON?

Anacron serves a different purpose to noexcuses. Anacron is for home computers that aren't always up and where nobody is going to be upset if their cronjobs don't run when scheduled (otherwise it wouldn't be expected for the host to be down). What it does is to ensure that too much time doesn't pass between invocations of periodic jobs assuming that the host is up occasionally. The success or failure of those jobs is of no more importance to anacron than it is to cron itself.

Noexcuses is for production systems where the cron host is always up (unless there's a catastrophe) and cronjobs require access to other hosts that are always up (unless there's a catastrophe) and there are people depending on cronjobs being run either when scheduled or as soon as possible after any catastrophes are resolved. What it does is to ensure that cronjobs run successfully even in the face of disasters in the machine room.

WHY IS THERE NO NAME OPTION?

You might be surprised that noexcuses provides no option to name a cronjob. Naming is usually desired to ensure uniqueness. In other words, a name option could ensure that, if an attempt to run a named cronjob fails, then any subsequently cronned instances of identically named cronjobs would be suppressed until the outstanding job of the same name finally succeeds. This might seem desirable but it is inappropriate for the sort of cronjobs that need to be guarded by noexcuses. These jobs are the ones that must run no matter what. That's why you would use noexcuses with them. If suppressing cronjobs were acceptable, then it would not make sense to use noexcuses at all for that cronjob. In fact, if it were acceptable to run just any one instance of the cronjob, then just letting the cronjob fail until the problem is resolved (after which it can run successfully) would be equally acceptable. In other words, if you think that a name option is needed for a particular cronjob, then that cronjob probably doesn't need noexcuses. Consider daemon instead which does provide the ability to name a task and ensure that there is only a single instance of it running at any one time (http://libslack.org/daemon/).

CAVEAT

You must make sure that cronjobs that are to run via noexcuses are written appropriately. Cronjobs that perform a single action using a single remote server are probably fine. However, cronjobs can get more complicated than that. A cronjob that requires two remote servers, either or both of which might fail, should be written so as to not waste time on subsequent attempts to run, nor should they modify external data multiple times if that would not be a good thing to do. For example, a cronjob that updates and queries a remote database, saves the data to disk, and then sends the data to a remote FTP server has two main points of failure to consider: the database and the FTP server. The cronjob should be written so that if only the FTP server fails, the database activity isn't repeated on subsequent attempts to run the cronjob. At best, it's inefficient. At worst, performing the database updates multiple times could be a very bad thing indeed. In other words, all cronjobs should be efficiently idempotent and failsafe. And of course, they must communicate their failure to their caller, noexcuses, by exiting with a non-zero exit status.

You need to be wary when the cron host is part of a cluster. There are many different types of cluster and I can't offer specific instructions on making noexcuses work in your particular cluster environment except to advise you to know how your cluster works and to know how noexcuses works and to make sure that they will work together. For safety in a clustered environment, it's probably best not to use /etc/init.d/noexcuses. Instead, run noexcuses manually on one of the hosts in the cluster after a reboot (at least until you can be sure that you know what you are doing).

Also, be aware that noexcuses doesn't know when you have changed your crontab or what it looked like before. For the purposes of the -M and -R options, noexcuses assumes that the current crontab has been in place unchanged for the entire time period specified on the command line. This would normally be the case. If it is not the case you can get many false positives. Don't believe them. Even better, don't change your crontabs before recovering lost jobs after a reboot of the cron host.

BUGS

The following only applies to the -M and -R options.

Doesn't examine atjobs to look for any jobs that should have started while the host was down. So don't use noexcuses in atjobs.

Doesn't examine the (Vixie cron) system crontab (/etc/crontab) to look for any jobs that should have started while the host was down. So don't use noexcuses in /etc/crontab. Similarly, /etc/anacrontab is ignored.

There is only partial support for Vixie cron syntax extensions. The user column in /etc/crontab isn't supported. Mapping % to return in commands isn't supported. However, the following extensions are supported: environment variables, step values (e.g. */5 and 0-23/2); day and month names (first three letters thereof); and the special strings (i.e. @yearly, @annually, @monthly, @weekly, @daily, @midnight and @hourly (but not @reboot)).

Doesn't replicate shell parsing of commands. This matters when comparing crontab entries against the log file entries to identify which scheduled jobs were not logged as having run. If the command in the crontab doesn't exactly match a log file entry, noexcuses will think that the job hasn't run and that it needs to be run. So make sure that your cronjobs are simple commands. Don't use any shell meta-characters except for a single space between each argument. In other words:

 cmd arg1 arg2 arg3 # GOOD (no metacharacters)
 cmd arg1 arg2 'arg3' # BAD (quotes)
 cmd arg1  arg2  arg3 # BAD (extra spaces)
 cmd  arg1 'ar g2' arg3 # BAD (quotes, spaces)
 sh -c "cmd \"a b c\" && echo yes || echo no" # BAD (everything)

I cannot guarantee that the -R option will detect every job that missed out on being run nor that it will not detect any jobs that did run. It works provided that cron starts jobs within 1 minute of the time for which they are scheduled and provided there's no disk corruption of the log file and provided that the cron host doesn't crash at a particularly bad time. See the RACE CONDITIONS section below to find out when that is. None of that can be guaranteed. Always check with the -M option first. This will print a list of commands that the -R option would execute. Use the -D option as well to verify the list of missing jobs yourself. If the list is wrong, run the correct commands manually.

In particular, after a cron host crash, determine if a cronjob could have been running at the time of the crash and somehow verify whether or not it completed before noexcuses could log it. If you don't, and just run noexcuses with the -R option, you might end up running a cronjob that did already complete successfully.

Similarly, after a cron host crash, determine if a cronjob was logged as having run unsuccessfully at the time of the crash but for which there is no corresponding state file. If you don't, and just run noexcuses with the -R or -x option, you might miss out on running a cronjob that has not yet been run successfully.

Also, after a cron host crash, determine if a cronjob was logged as having run unsuccessfully before the cron host crash and determine somehow whether or not it might have just completed successfully at the time of the cron host crash, before its statefile could be removed. If you don't, and just run the noexcuses with the -x option, you might end up running a cronjob that did already complete successfully.

For these reasons, the /etc/init.d/noexcuses initscript shouldn't really be used at all. Recovery after a cron host crash should be done manually unless you know that no harm will be done if any cronjob runs an extra time and you know that you can subsequently identify any cronjob that didn't run and restart those manually. The default behaviour of /etc/init.d/noexcuses is to do nothing at boot time. Resist the temptation to change that unless the risk is acceptable.

RACE CONDITIONS

The -x option works by examining the state files (as do the -l, -p and -c options). The -M and -R options work by examining the log file and crontabs.

Log entries are written after the first attempt terminates (whether successfully or not). The state file is written after the first attempt terminates unsuccessfully. This "ensures" that, in the event of a cron host crash or reboot, one and only one of the -x and -R options will start a given cronjob. If the cron host crashes before or during the first attempt, -R will run it. If the cron host crashes after the first attempt failed, -x will run it.

However, there are three race conditions. If the cron host crashes after a cronjob's first attempt terminates successfully but before the log file entry has been written, then the -R option after rebooting will run the cronjob even though it terminated successfully. If the cron host crashes after the first attempt terminates unsuccessfully and after the log file entry has been written but before the state file is written, then neither the -R option nor the -x option after rebooting will run the cronjob even though it did not terminate successfully. If the cron host crashes after an initially unsuccessful cronjob finally succeeds but before its state file is removed, then the -x option after rebooting will run the cronjob even though it terminated successfully. The windows of opportunity here are very small but they are there nonetheless. Disk corruption of the log and state files will also ruin your day. So make use of the mirror parameter in the configuration file.

So noexcuses doesn't really ensure anything but it is a vast improvement over cron by itself. Even so, please monitor your uninterruptible power supplies and generators and test them regularly. I really mean that. If that's too much to ask, then sudden power failures are bound to happen and might trigger one of the race conditions.

GOLEM

I wanted to call this program golem because it makes other programs more determined than they would otherwise be to obey the words in their heads. It seems to me to be a singularly appropriate name. However, there are already too many programs out there called golem. And anyway, all software are golems. This program just makes them better golems. Of all the other names I thought of, noexcuses seemed to be the most sensible.

SEE ALSO

cron(8), crontab(1), crontab(5), anacron(8), init(8), daemon(1)

AUTHOR

20200625 raf <raf@raf.org>

URL

http://raf.org/noexcuses/, https://github.com/raforg/noexcuses/