NAME

textmail - mail filter to replace MS Word/HTML attachments with plain text

SYNOPSIS

 usage: textmail [options]
 options:

   --help    - Print the help message then exit
   --version - Print the version message then exit
   -h        - Print the help message then exit
   -m        - Print the manpage then exit
   -w        - Print the manpage in HTML format then exit
   -r        - Print the manpage in nroff format then exit
   -M        - Output in mailbox format (mboxrd)
   -T        - Output in raw mail format (for SMTP)
   -W        - Don't replace MS Word attachments with text
   -E        - Don't replace MS Excel attachments with csv
   -H        - Don't replace HTML attachments with text
   -R        - Don't replace RTF attachments with text
   -P        - Don't replace PDF attachments with text
   -U        - Don't translate winmail.dat attachments
   -L        - Don't reduce appledouble attachments
   -I        - Don't delete image attachments
   -A        - Don't delete audio attachments
   -V        - Don't delete video attachments
   -X        - Don't delete MS Windows executable attachments
   -B        - Don't recode text that is base64-encoded
   -S        - Don't replace spaces in filenames with underscores
   -Z        - Do translate signed content (discards signatures)
   -O        - Delete all application/octet-stream attachments
   -!        - Delete all application/* attachments
   -D hdrs   - Delete headers (list of header prefixes and filenames)
   -K types  - Keep attachments (list of mimetypes/exts and filenames)
   -F types  - Save attachments (list of mimetypes/exts and filenames)
   -G path   - Directory to save attachments in (for use with -F)
   -C spec   - Custom attachment translations (mimetype_or_ext:ext:cmd)
   -Y        - Choose plain text alternatives over translated HTML
   -Q spec   - Custom patterns to identify vestigial text alternatives
   -f        - On translation error, keep translation, not original
   -?        - Print paths of helper applications then exit

DESCRIPTION

textmail filters a mail message or mbox file, replacing MS Word, MS Excel, HTML, RTF and PDF attachments with the plain text contained therein. By default, the following attachments are also deleted: image, audio, video, and MS Windows executables. MS winmail.dat attachments are replaced by any attachments contained therein which are then replaced by text or deleted in the same fashion. Any of these actions can be suppressed with command line options. Mail headers can also be selectively deleted. Attachments can also be extracted and saved to disk.

This is useful for increasing the accessibility of mail messages (by reducing their dependence on proprietary file formats), for dramatically reducing their size (and the time it takes to download them and the time it takes to read them), and for dramatically reducing the risk of mail-borne viruses. Its intended use is to reduce the size of a personal email archive but it could also be useful as a pre-processor for mailing lists. This is more friendly than a strict "No Attachments" or "No HTML" mailing list policy.

OPTIONS

--help

Print the help message then exit.

--version

Print the version message then exit.

-h

Print the help message then exit.

-m

Print the manpage then exit. This is equivalent to executing man textmail but this works even when the manpage isn't installed.

-w

Print the manpage in HTML format then exit. This lets you install the manpage in HTML format with a command like:

  mkdir -p /usr/local/share/doc/textmail/html &&
  textmail -w > /usr/local/share/doc/textmail/html/textmail.1.html
-r

Print the manpage in nroff format then exit. This lets you install the manpage with a command like:

  textmail -r > /usr/local/share/man/man1/textmail.1
-M

This option causes the output to be in mboxrd format by adding a mailbox From line at the top if there isn't one already and ensures that there is a blank line at the bottom of the output. It also performs mailbox quoting on any lines in the body that look like mailbox From headers. Use this when the output is to be stored directly in a mailbox file. It is not necessary when textmail is being used as a mail filter by procmail(1).

-T

This option causes the output to be in raw mail format by removing any mailbox From line and by not performing mailbox quoting. Use this when the output is to be sent directly to an SMTP server. It is not necessary when textmail is being used as a mail filter by procmail(1).

-W

By default, textmail replaces MS Word attachments with inline plain text attachments that contain just the plain text within the original document. This option leaves MS Word attachments intact.

-E

By default, textmail replaces MS Excel attachments with CSV file attachments that contain just the data within the original document. This option leaves MS Excel attachments intact.

-H

By default, textmail replaces HTML attachments with inline plain text attachments that contain just the plain text within the original HTML attachment. It also replaces text-versus-HTML alternative attachments with the HTML alternative translated to plain text. This option leaves HTML (and alternative) attachments intact.

-R

By default, textmail replaces RTF attachments with inline plain text attachments that contain just the plain text within the original document. This option leaves RTF attachments intact.

-P

By default, textmail replaces PDF attachments with inline plain text attachments that contain just the plain text within the original document. This option leaves PDF attachments intact.

-U

By default, textmail replaces MS TNEF (i.e. winmail.dat) attachments with the attachments contained therein which are then translated to text as normal. This option leaves winmail.dat attachments intact. This option, together with the -! option will cause winmail.dat attachments to be deleted rather than translated.

-L

By default, textmail replaces multipart/appledouble attachments with just the data fork attachment contained therein which is then translated to text as normal. This option leaves appledouble attachments intact. However, the data fork attachment will still be translated as normal resulting in a probably inappropriate and possibly broken resource fork attachment. Therefore, this option should probably only be used in conjunction with other options that suppress the translation of the data fork attachment.

-I

By default, textmail deletes image attachments. This option leaves image attachments intact.

-A

By default, textmail deletes audio attachments. This option leaves audio attachments intact.

-V

By default, textmail deletes video attachments. This option leaves video attachments intact.

-X

By default, textmail deletes attachments containing MS Windows executables. That means attachments with the following filename extensions: com, exe, pif, dll, ocx, scr, vbs, js, bat and ps1. This option leaves MS Windows executable attachments intact. To delete zip files as well, you could use either the -O option or the -! option.

-B

By default, when text is encountered that is base64-encoded, textmail will recode it as either 7bit or quoted-printable, whichever is appropriate. This option suppresses this recoding. Note that if the text is large enough and contains a high enough proportion of non-ASCII characters, it will remain base64-encoded to minimise space.

-S

When translating attachments, textmail replaces any bad filename characters such as space characters with the underscore character. This option causes underscore characters to subsequently be converted into space characters. In other words, you can use this option to preserve space characters in attachment filenames (other bad filename characters will then be converted to spaces as well).

-Z

By default, textmail will not translate multipart/signed attachments. This option causes multipart/signed attachments to be replaced by the signed attachment contained therein, discarding the signature control data. The no-longer-signed data is then translated to text as normal. Note that multipart/encrypted attachments are never translated.

-O

Delete all application/octet-stream attachments, not just MS Windows executables. Note that this overrides -X but -K overrides this.

-!

Delete all application/* attachments. Note that this overrides -X but -K overrides this. Also note that translated documents are no longer application/* attachments so they aren't deleted unless their translation is suppressed with the appropriate command line option.

-D hdrs

Delete selected headers. The hdrs argument is a comma-separated list of header name prefixes and/or the names of files containing header name prefixes (blank lines, leading or trailing whitespace, and shell-style comments are ignored). For example, textmail -DX- deletes all headers whose names begin with X-.

-K types

By default, textmail deletes several types of non-text attachment. The -O and -! options delete even more. This option specifies, by mimetype and/or filename extension, the set of attachments not to delete. This overrides all deletions.

The types argument is a comma-separated list of mimetypes and/or filename extensions and/or the names of files containing mimetypes and/or filename extensions (blank lines, leading or trailing whitespace, and shell-style comments are ignored). Note that the elements are interpreted as a complete mimetype, if they contain a slash character, or as either the * in application/* or as a filename extension if they do not contain a slash character. For example, textmail -Wf!Kdoc,docx deletes all application/* attachments except MS Word documents.

-F types

This option specifies, by mimetype and/or filename extension, the set of attachments to save to files on disk. This happens before any translations or deletions in the email message itself. By default, attachments are saved to the current directory. The -G option can be used to specify an alternative directory to save attachments to.

The types argument is a comma-separated list of mimetypes and/or filename extensions and/or the names of files containing mimetypes and/or filename extensions (blank lines, leading or trailing whitespace, and shell-style comments are ignored). Note that the elements are interpreted as a complete mimetype, if they contain a slash character, or as either the * in application/* or as a filename extension if they do not contain a slash character. For example, -F doc,docx saves MS Word documents to the current directory.

-G path

This option specifies the directory to save attachments to when used with the -F option. Without this option, attachments are saved to the current directory.

-C spec

This option specifies custom translations for attachments with particular mimetypes or filename extensions.

The spec argument is a comma-separated list of translation specifiers and/or the names of files containing translation specifiers (blank lines, leading or trailing whitespace, and shell-style comments are ignored).

Each translation specifier contains two or three colon-separated items: the mimetype or filename extension of the attachments to translate; the (optional) filename extension to use for the resulting translated attachment; and the simple shell command to translate the attachment. The shell command must read the file specified on the command line, and write the resulting translated attachment to its standard output. For example:

    -C text/calendar:txt:vcalendar-filter
    -C text/calendar:vcalendar-filter

The mimetypes, filename extensions and shell commands must not contain any comma or colon characters or nul bytes. If the optional filename extension to use for the resulting translated attachment is not supplied, it is assumed to be "txt".

-Y

By default, unless the -H option is given, textmail replaces text-versus-HTML alternative attachments with the HTML alternative translated to plain text.

Earlier versions of textmail would replace them with just the plain text alternative. Unfortunately, the plain text alternative is sometimes a vestigial attachment that isn't a real text alternative of the HTML content. It is often just a message indicating that you should be reading the HTML alternative instead. So that's what textmail does by default now.

This option causes textmail to mostly revert to its original behaviour and replace text-versus-HTML alternative attachments with just the plain text alternative. However, if textmail can identify the plain text alternative as a vestigial attachment that isn't a real plain text alternative, it will translate the HTML alternative instead, so as not to discard the content of the attachment.

textmail will identify a plain text alternative as vestigial if it is empty or extremely short, or if it contains any of the following pieces of text:

    Your email client does not support HTML email
    Your email client cannot read this email
    This email must be viewed in HTML mode
    Please enable HTML

This option is not recommended. Translating the HTML alternative is probably always a better choice. Any URLs referred to by the HTML alternative will be preserved in the translation to plain text. This is usually not the case in the original plain text alternative. This option is only provided in accordance with the principle that any feature that can't be turned off is a bug.

If the -H option is supplied, this option does nothing.

If the above list of patterns is insufficient, the -Q option can be used as well to include additional patterns to identify vestigial text alternatives.

-Q spec

This option specifies custom patterns for identifying vestigial text alternatives. It is only useful when the -Y option is used to choose plain text alternatives over translated HTML alternatives. If the -Y option is not supplied, this option does nothing.

The spec argument is a comma-separated list of patterns (i.e. a short piece of text that would appear in a vestigial text alternative), and/or the names of files containing patterns (blank lines, leading or trailing whitespace, and shell-style comments are ignored.

-f

Whenever textmail is unable to translate any attachment into text, it will leave the attachment intact. This happens when the requisite translation software can't be found, when it runs but returns an error code, and when it produces no output. It also happens when winmail.dat attachments are corrupt. This option causes the empty translation to take the place of the original attachment. Only the name of the attachment is preserved. This is needed to ensure plain text even in the face of an MS Word document that contains no text (e.g. only images).

-?

Print the paths of all helper applications then exit.

EXAMPLES

A procmail(1) recipe that insists on pure text and no X- headers (with output in mailbox format):

  :0 fw
  | textmail -Mf!DX-

Do the same but to an existing mailbox file:

  textmail -Mf!DX- < mailbox > mailbox-as-text

Delete all application/* attachments except for PostScript and PDF (and don't translate PDF into text):

  textmail -!PKps,pdf

Delete all application/* attachments except for zip files and gzipped tar files:

  textmail -!Ktar.gz,zip

A procmail(1) recipe that just unpacks winmail.dat attachments but doesn't translate the attachments contained therein into text and doesn't delete windows executables (with output in mailbox format):

  :0 fw
  | textmail -MWEHRPLIAVXBS

Save MS Word, MS Excel, and PDF attachments in /tmp without changing the message:

  textmail -WEHRPULIAVXBS -F doc,docx,xls,xlsx,pdf -G /tmp

Save MS Word, MS Excel, and PDF attachments in /tmp without changing the message (other than translating winmail.dat attachments into standard attachments):

  textmail -WEHRPLIAVXBS -F doc,docx,xls,xlsx,pdf -G /tmp

Replace text/calendar and text/vcard attachments with plain text:

  textmail -WEHRPULIAVXBS -C text/calendar:vcalendar-filter,text/vcard:mutt.vcard.filter

REQUIREMENTS

Modern MS Word attachments (.docx) are translated into plain text using docx2txt(1). If textmail can't find docx2txt(1), then modern MS Word attachments are left intact. So make sure that docx2txt(1) is installed and in the $PATH.

Traditional MS Word and RTF attachments (.doc and .rtf) are translated into plain text using antiword(1) or catdoc(1). If textmail can't find antiword(1) or catdoc(1), then traditional MS Word and RTF attachments are left intact. So make sure that antiword(1) and/or catdoc(1) is installed and in the $PATH.

Modern MS Excel attachments (.xlsx) are translated into csv files using xlsx2csv(1). If textmail can't find xlsx2csv(1), then modern MS Excel attachments are left intact. So make sure that xls2xcsv(1) is installed and in the $PATH.

Traditional MS Excel attachments (.xls) are translated into csv files using xls2csv(1). If textmail can't find xls2csv(1), then traditional MS Excel attachments are left intact. So make sure that xls2csv(1) is installed and in the $PATH.

HTML attachments are translated into plain text using lynx(1). If textmail can't find lynx(1), then HTML attachments are left intact. So make sure that lynx(1) is installed and in the $PATH.

PDF attachments are translated into plain text using pdftotext(1). If textmail can't find pdftotext(1), then PDF attachments are left intact. So make sure that pdftotext(1) is installed and in the $PATH.

textmail also requires perl(1) and pod2man(1) and pod2html(1) (which come with perl(1)) and mktemp(1).

If textmail fails to create a temporary directory, or if it is instructed to do nothing (i.e. textmail -WEHRPULIAVXBS), then it degenerates into cat(1).

CAVEAT

The latest version of catdoc's xls2csv(1) at the time of writing (i.e. catdoc-0.93.3) loses data. There are alternatives.

If textmail is unable to create a temporary directory (in /tmp), then it degenerates into cat(1). Without a temporary directory, no attachments will be translated or deleted no matter what options (even -f) were given to textmail. So make sure that /tmp is writable. Also make sure that mktemp(1) is available otherwise an insecure temporary directory will be created.

SEE ALSO

procmail(1), docx2txt(1), antiword(1), catdoc(1), xlsx2csv(1), xls2csv(1), lynx(1), pdftotext(1), pod2man(1), pod2html(1), vcalendar-filter, mutt.vcard.filter, http://raf.org/minimail/

AUTHOR

20200620 raf <raf@raf.org>

URL

http://raf.org/textmail/