Charset/CCDD (was: Let's develop an open-source media archive from Hans Franke on 2004-08-13 (2004-August)

From: Hans Franke <Hans.Franke_at_siemens.com>
Date: Fri Aug 13 10:05:44 2004

Am 12 Aug 2004 16:10 meinte Sean 'Captain Napalm' Conner:
> It was thus said that the Great Hans Franke once stated:
> > Rather then restricting the encodeing of the XML file to a
> > specific charset, we need to restrict the USAGE within the
> > standard to certain characters, regardless of the encodeing.

> Unless otherwise noted, XML files are assumed to be encoded in UTF-8,

Except that this is not realy followed by any parser - all I
found assume ISO 8859-1 or some other stuff, but produce UTF-8.

> *but* an XML parser is required to abort at the first error in the XML file.
> If a parser is reading an XML file without an explicit character set
> encoding scheme (which means it's assuming UTF-8) and it reads a character
> that is illegal (say the file was encoded in ISO-8859-3) it gives up
> (usually with an "illegal character at such-n-such position" error).

Not necersarry, if the parser accepts 8 Bit input, it just takes
the charactes as is - a X'C3' ist just seen as that, no matter if
it's 8859-1 or -3 or whatever. Now if he expects UTF-8, then still
X'C3' is a valid first character of a two character sequence. Only
if the next doesn't start with B'10' (read X'80'..X'BF'), an error
will occure. Quite likely if you look at some langiuage encodeings.
I had whole 8 Bit russian files sliping thru UTF-8 parsers without
illegal character errors.

> Right now, this is a real problem with XML deployment (it gets even
> wierder when XML files are transported via HTTP but I'm getting ahead of
> myself) so when I suggested that (if we are using XML) that each *must*
> start with:

> <?xml version="1.0" encoding="US-ASCII"?>

> It was a way of self-defense. Perhaps it can be relaxed some and require:

> <?xml version="1.0" encoding="some XML defined character encoding scheme"?>

> and if the encoding scheme isn't defined, it's an error and further
> processing of the archive should stop.

Basicly I fully support your motion here ... if we would live in
a binary world. A world where for one thing any Parser can parse
any character set, and where all files are always transported in
binary mode. But we don't. If I generate a file on a EBCDIC system,
and transfer it to a Unix system, the file transfer will convert
the characterset from the EBCDIC variation I use to the ASCII
variant on the target system - if I included an 'encodeing="EBCDIC"'
in the ?XML line, it suddenly would be 100% wrong. Now what?

Includeing the encodeing _within_ the file as part of the data
is in my opinion the only real flaw in the basic XML definition
(if we ignore all that rubish of namespaces etc.). The carset
of a text file is always a meta information which should never
(only) be part of the application specific data within the file.

The whole XML part works only by assumeing that the encodeing
is at least done with something that is compatible to the systems
default code, otherwise the parser couldn't even find the
<?xml line.

> > I suggest to restrict the caracters used in tags, attribute
> > names and attributes to 'A-Z' (uppercase), '0-9' and '-'.

> Unfortunately, XML is defined with lowercase (or it may be case
> sensitive---I do know that all XML I've seen is with lowercase tags, and
> it's pretty much a standard).

No, XML is case sensitive, not lower case or whatever. In fact,
you can use all characters you may find to define tags or atributes.
One could for example use German Umlauts within tags, or even so
all tags in Kanji or arabic. So for our definition we can prety
much stay with all upercase tags, attribute names and attributes.
I strongly recommend this to allow the usage of our format on
systems/charsets that offer only uppercase latin letters. This
is not only true for old Russian or Japaneese 8 Bit systems,
even the good old Apple ][+ had only uppercase !

Beside that, uppercase is, to me, way more readable as keywords.
decideing for lowercase was one of the many bad ideas in XHTML.

Oh, and at least that part the XML people did get right, since
it is defined, that the leading ?XML may be upper or lower case.

Gruss
H.

--
VCF Europa 6.0 am 30.April und 01.Mai 2005 in Muenchen
http://www.vcfe.org/

Received on Fri Aug 13 2004 - 10:05:44 BST

This archive was generated by hypermail 2.3.0 : Fri Oct 10 2014 - 23:36:34 BST