UTF-7 (was Re: OT: Best programming suite recommendations)

From: Pete Turnbull <pete_at_dunnington.u-net.com>
Date: Sat Nov 2 19:55:00 2002

On Nov 2, 15:36, Robert Wittig wrote:
> > I'm not sure how one disables the annoying non-standard escape
sequences
> > used by Microsoft's mail clients. ISTR someone else having this same
> > problem several months ago.

Er, yes, IIRC it was the same person :-)

> That was my text, being back-quoted by John, that had the weird escape
codes. Is
> the problem something that is being generated by John's backquote, or am
I the
> culprit?

No, it's John's mail client. It's not just the quotes or whatever is
being used to indent the quoted text. It's anything that isn't flat 7-bit
ASCII, with some characters reserved to allow encoding.

For example, the line

> +ACo- +ACI-Data Explorer, from the IBM +AF8-Deep Computing
Institute+AF8AIg-

contains 4 different escape sequences, one of which is 8 characters long!

UTF-7 is a coding system designed to handle Unicode (which is normally
16-bit or 20-bit) by using escape sequences to encode non-ASCII characters
in a way similar to the scheme base64 uses for whole chunks, and intended
only for use on 7-bit systems that can't handle 8-bit data.

> OE 5 has several choices for indent on replies... I am using '> '. If
John is
> using ': ' or '| ', (the other 2 choices), they might be getting read as
> something else by your MUA, and changing the indent might eliminate the
problem.
> What MUA's are you guys running?

The text appears with the wierd escape sequences on loads of MUAs, in fact
probably most of them. Very few MUAs support UTF-7, because it has
inefficient compression, there are other encodings that are more versatile,
it's more awkward than most to program, and it's a one-to-many transform
(for any given string, there are several possible encodings) so it produces
unsearchable text. And since it came along later than the other common
encoding schemes, and doesn't do anything the others can't, I suppose there
was an element of "why bother?"

If you look up Unicode and UTF-8, you'll find dozens of common applications
that support it (and UTF-8 is the accepted standard for a whole load of
things defined in RFCs, as well as the mail internationalization report
from the IMC) but the only application I know of that definitely handles
UTF-7 is Outlook. Quote from the IMC report: "Fortunately, very few
vendors implemented UTF-7, and its use is strongly discouraged in Internet
mail."

The solution is to turn off the UTF-7 character set, use ISO-8859-1 or
UTF-8 or something else that's commonly accepted, and then use a standard
content-transfer encoding, like quoted-printable or base64 if you have to
make it 7-bit-safe.

-- 
Pete						Peter Turnbull
						Network Manager
						University of York
Received on Sat Nov 02 2002 - 19:55:00 GMT

This archive was generated by hypermail 2.3.0 : Fri Oct 10 2014 - 23:35:25 BST