Fundamentals of Internet EMail    [v0.0x]
==============================

Copyright (c) 2001 Phil Pennock <email-fundamentals@globnix.org>
Permission is granted to copy, distribute and/or modify this document
under the terms of the GNU Free Documentation License, Version 1.1 or
any later version published by the Free Software Foundation, with no
Invariant Sections, no Front-Cover Texts and with no Back-Cover Texts.
A copy of the license is available at
 <URL:http://www.fsf.org/copyleft/fdl.html>.

The most recent version of this document can be found at:
 <http://www.globnix.org/mail-intro.txt>

Contents:

 * Prerequisites
 * Infrastructure
 * Message Structure and Composition
 * Local Delivery & Access
 * Handling Problems


Prerequisites
-------------

You should be able to send and receive email, be able to use your mail
client to view full message headers, and you should have a brain willing
to absorb information.  Oh, and you should understand terms such as
"operating system" & "email address".  If you don't, this isn't for you.
Stop reading now.

Experienced mail administrators should already know all this.  Comments
and feedback welcome, to <email-fundamentals@globnix.org>.  Warning: I
can be short-tempered with people who claim to know what they're doing,
but don't.

For the others amongst you: hopefully, this quick guide will bring you
up to speed on the fundamentals of email and associated terminology, as
currently found on the public Internet.

For a free starter: RFC = Request For Comment.  These documents cover
many topics.  Many are standards or proposed standards for public usage,
in various stages of development.  Others are helpful introductions,
progress reports, experimental protocols, the list goes on.  Readability
varies considerably.  If in doubt, look!  A web interface for retrieving
RFCs is at <http://www.ietf.org/rfc.html>

Second starter: Email addresses are, for the purposes of this document,
divided into two parts, a local part on the left-hand side and an email
domain on the right-hand side.  So <fred@example.org> has the local part
"fred" and the email domain "example.org".


Infrastructure
--------------

Most people only ever deal with their mail client.  Many seem to think
that this client _is_ email access or somesuch nonsense.  No.  The mail
client is the nice (!) glossy (!!) front-end, so that the end-users
don't have to deal with what's underneath.  People who do have to
actually run the infrastructure have their own terminology for the
various types of system, so as to be absolutely precise in their
references.  It's also clear, once you get the hang of it.  Honest.

The mail clients are referred to as "Mail User Agents", or MUAs.  Since
the word "client" has another common meaning in client-server
terminology, which will be used below, from now on the term MUA will be
used to refer to whatever piece of software is used to control the
reading and sending of email.  Examples include Eudora, Mutt, allegedly
Outlook, etc.

The main protocol used for passing on mail as it arrives is Simple Mail
Transfer Protocol (SMTP) and its derivatives.  It's described in RFC2821.
RFC2821 replaces RFC821, incorporating many changes from actual
practice, such as ESMTP, Extended SMTP, previously described in RFC1869.
This document will use the term SMTP for the newer specification.

Some MUAs speak just enough SMTP to be able to send out email.  Others
utilise a local command to submit mail.  But for the most part SMTP is
spoken by Mail Transport Agents, or MTAs.  It's MTAs that route mail,
translate to/from non-Internet protocols, generate error messages and
bounces and so on.

When determining _how_ to route mail closer to its destination, an MTA
typically has special rules for local email domains, and otherwise looks
up a special type of DNS (Domain Name System) record, the MX record,
for the email domain so as to get a hostname to connect to, to pass the
mail on.  MX records have a precedence value associated with them.  The
lowest value will be used if reachable, with higher values being tried
if the lowest is unreachable.

Some MTAs don't use DNS for each email.  Instead, they're configured to
just send all mail onto another host which will take care of it.
Typically, an MTA running on the system of an ISP's customer will pass
all mail onto the other host.  Some firewall arrangements also mandate
this.  The MTA which accepts all such mail for a given set of machines
is known as a Smarthost.  It's the MTA burdened with all the hard logic.

In all cases, there is a clear hand-off of responsibility for delivering
each message.  Once an MTA accepts a messages and acknowledges success,
it is then responsible.  The MTA which handed over the message will
consider the message successfully handled.  So there's a single chain of
MTAs used.  SMTP never tries to successfully send one message via
multiple routes; multiple MTAs may be tried, but the first one which
accepts the message, gets it.

So, the email messages wends its way across the public Internet and
hopefully arrives at a system which regards the email domain as local.
At this point, it's difficult to generalise what happens as there are so
many options.  See the section "Local Delivery" below.


Message Structure and Composition
---------------------------------

An email message being transferred via SMTP can in the first instance be
divided into two parts.  The SMTP Envelope and the SMTP Data.  Message
routing is normally performed based upon the Envelope.  Most systems
have no need to look inside the Data in order to find out how to route
the message.  In fact most systems should not.  The exceptions are the
system which initially submits the mail for delivery and perhaps a
system performing local delivery.  Typically, the end-user doesn't
directly see the SMTP Envelope information.  The separation of Envelope
and Data is part of the SMTP protocol itself.  All lines in both
Envelope and Data are (discounting special rarely-used cases) terminated
with CRLF (Carriage-Return, Line-Feed).  We'll return to the SMTP
Envelope at the end of this section.

In the second instance, the SMTP message's Data section can also be
divided into two parts.  The message Headers and the message Body.  This
is as per RFC2822.  The two sections are separated by a blank line, ie
the sequence CRLFCRLF.  Headers come first, and are key-value pairs.

The keys are case-insensitive, must start immediately after a CRLF and
each key is separated from the associated value by a colon, ":", with
optional whitespace on either side.  If a line starts with whitespace,
then it is continuation data which constitutes part of the value of the
most recent line to start with a key.  Although the keys are not
case-sensitive, they are by convention written to start with a capital
letter and to have a capital after any hyphens, but otherwise be
lower-case.  The case sensitivity and semantics of each header value
depends upon the key.  The only guarantee is that a CRLF followed by
whitespace is equivalent to a single space character.

Some common keys seen in headers are "From", "To", "Subject", "Date",
"Content-Type" and "Received".  Those last two deserve more explanation.

Above, it was stated that most MTAs don't need to look inside the Data
section in order to make decisions.  This is true.  However, MTAs _must_
make one specific change and can in certain circumstances make others.
Each and every MTA which forwards a message prepends an extra header to
the message, the "Received" header, which describes the transaction used
to handle the message.  The header is prepended (put at the start) to
make this easy and fast, instead of having to search for the end of the
headers.  If you look at an email which you have received and view all
the headers, you should see at (or perhaps nearly at) the start the
Received headers.

As a consequence of being prepended these are of course in reverse order
with the most recent such header at the top.  This first Received header
will describe the transaction during which the last MTA saw the message.
The last Received header describes the first MTA to see and handle the
message.  Whilst the format is not rigidly specified, each Received
header should indicate the canonical (standardised) name used by the
host on which the MTA is running, a timestamp to indicate when the
message was processed, a local transaction identifier and any other
information which is relevant.  A typical MTA transaction which involves
receiving the message from somewhere would therefore also include
identifying information for the MTA from which the message was received.
Herein lies the curse which makes ISP abuse-teams weep.

The only part of the identifying information for the source IP which can
be relied upon is the IP address.  Hope and pray that the MTA includes
this.  Of course, if you don't trust any previous hosts in the delivery
chain, even this is suspect since the Received header could be faked.
The hostname?  Are you sure that the MTA verifies that the forward DNS
for that hostname matched the reverse DNS for the IP address?  But are
you looking at the hostname?  The first part of the SMTP protocol
involves saying hello, with either the HELO or the EHLO command.  The
name provided won't necessarily have any bearing upon reality, but it
will typically be included in the header.  Spammers fake this.  ISPs get
many complaints from people who don't realise that it's faked and think
that the ISP is allowing spam through.  A good MTA will allow the mail
administrator to require that the HELO/EHLO have some bearing upon
reality.  Even where this is available, it is unfortunately often not
enabled.  *sigh*

And so, the Content-Type header.  As the name suggests, it describes the
type of the data in the message body.  The structure of the header value
is defined in RFC2046.  For handling normal mail, this is typically
irrelevant for the MTA and highly relevant for the MUA.  The default
type/subtype will be text/plain.

The two main circumstances for caring about Content-Type are for
filtering for security purposes and handling mailing-lists.
Mailing-lists typically just enforce text/plain (or *spit* text/html)
for various reasons.

Unfortunately, in today's world security holes in products are a fact of
life.  Some products have more security holes than others.  Some
products are riddled with holes.  Some environments make it difficult to
be secure.  Email is, in some environments, no different.  Whilst all
data sent is, in one sense, "instructions for a computer" (this document
consists of ASCII instructions for parsing by anything which feels like
it; this document is safe, it contains no ANSI escapes), there are
message-types which are considered more dangerous.  Raw executables,
complicated document formats with embedded macro capabilities, HTML with
scripting content, these are all potentially dangerous.

Filtering involves looking at the data.  Traditionally, the MTA hasn't
had to do this and hasn't had to natively understand MIME.  Effective
scanning within the MTA is not currently common.  Some rudimentary
checks against primitive dangers can be performed by the MTA.  Anything
comprehensive or more detailed currently involves handing the task off
to another process.


So, returning to the SMTP Envelope -- where does it come from?  The
first process which speaks SMTP constructs it.  From where?  Depends how
it's invoked.  What's in the SMTP Envelope?  Basically, if the extras
are ignored, it boils down to "Address of the Sender" and "Addresses of
the Intended Recipients", referred to below as SMTP From and SMTP
Recipients.

An MUA typically typically does not itself speak SMTP.  There are too
many different things to handle, all needing tuning for the special
circumstances of the local site, with this needing to be set up for each
of several MUAs.  Some MUAs might, but more typically they hand the
message off to another program.  On Unix-compatible systems, it just
happens to be that the interface which has evolved for initial injection
is to actually invoke the MTA in a non-daemon mode, where it can
optionally extract information directly from the RFC2822 headers.  This
is the "sendmail" interface, since for a long time Sendmail was _the_
predominant MTA on Unix systems.  Other MTAs will typically emulate this
interface.

For a normal email message constructed with an MUA, the SMTP From
address will probably match that in the RFC2822 "From" header.
Probably.  Some MUAs allow this to be otherwise.  Many MTAs which
receive a message locally, not via SMTP, and which can determine who
sent the message, will enforce a requirement that the email domain be
the canonical one for that MTA and that the local part correspond to
that of the user account used to send the email, unless the user is in
some way "trusted" to do otherwise.  If the user isn't so trusted,
typically an RFC2822 "Sender" header would also be inserted.

The SMTP Recipients are more complicated.  Often, there's bad karma
involved in trying to define exactly what should be done.  If the first
MTA is having to determine them (as would be the case with the
sendmail-compatible interface, described above), instead of being
directly told them by the MUA, then it would typically extract the
addresses from the RFC2822 "To", "Cc" and "Bcc" headers, then remove the
"Bcc" header.  All of this initial submission stuff is very Operating
System and software specific.

For an example of the confusion, the sendmail-compatible interface can
be used in two different ways for initial translation into an SMTP
message.  One way is to specify the message recipients on the
command-line, which just forwards mails on and _supposedly_ does NOT
strip Bcc.  The other is to pass sendmail the "-t" option, which does
the extraction defined in the previous paragraph.  But then there's
confusion over what happens with addresses specified on the command-line
too.  To quote the Exim documentation (since Exim tries to remain
Sendmail compatible for message submission):

  According to Sendmail documentation, if any addresses are present on
  the command line when the -t option is used to build an envelope from
  a message's headers, they are removed from the recipients list. This
  is also how Smail behaves. However, it has been reported that some
  versions of Sendmail in fact add the argument addresses to the
  recipients list.

Various other transformations might be done upon a new message.  Really,
this is "message submission" and arguably not the job of an MTA, but a
local submission agent.  But that sounds kinky.


Local Delivery & Access
-----------------------

[
 Topics needing handling:
  MTA vs/with MDA
  spools, spool formats, local users, storing envelope information
  IMAP, POP3, HTTP
  autoresponders (& SMTP From)
  mailing-lists (or in a Misc section?)
]


Handling Problems
-----------------

[
 Topics needing handling:
  reliability & responsibility hand-off
  bounces
  notification mail from <>
  DSN
  Duplication of messages (common causes)
  Spam
]


Document Miscellany
-------------------

After version 0.02, mail-intro.txt moved into RCS control.  The document
version will remain 0.something until the document is pretty much
complete.  The internal RCS revision of this document is $Revision: 1.4 $.

Phil would like to thank Demon Internet Netherlands, his employer, for
graciously allowing him to spend time on this document whilst at work
and for letting him play with weird large mail-systems.  This is a
shameless plug.  :^)  <http://www.demon.nl/eng/>

[ Text-editor magic line: vim:set wm=8: ]