Charset/CCDD (was: Let's develop an open-source media archive standard)

From: Hans Franke <>
Date: Thu Aug 12 09:06:02 2004

Am 11 Aug 2004 17:41 meinte Vintage Computer Festival:
> On Wed, 11 Aug 2004, Sean 'Captain Napalm' Conner wrote:
> > It was thus said that the Great Vintage Computer Festival once stated:
> > > > XML is more a more "current" technology but I was trying to keep with the
> > > > platform neutrality by sticking to text-only and not assuming the use of any
> > > > other technology like XML.
> > > XML is platform neutral because it's basically ASCII, right?

> > Nope. XML files can be represented in multiple character sets, possibly
> > including (but certainly not limited to):
> <snip!>
> > Best decide this now.

> Ok, I choose US-ASCII. This will be up for debate I'm sure, but surely
> US-ASCII is the most widely deployed character set in the world currently?

Well, yes and no. Shure, a lot of the most common codes are
to some extend US-ASCII compatible, but different, thus even
the use of a stric 7 Bit ASCII doesn't save us from complications.
But beside that, it's not the point. Especialy when allowing
the inclusion of binary data, and/or when switching between
systems, ASCII can become incompatible.

Further more, such a definition (US-ASCII only) would make it
quite hard to use the format in a mainframe environment, wher
EBCDIC is still the language of choice.

Already on a PET we have differences in the 7 Bit encodeing.
Not to mention the even bigger problems to handle these files
on real old systems.

Rather then restricting the encodeing of the XML file to a
specific charset, we need to restrict the USAGE within the
standard to certain characters, regardless of the encodeing.

I suggest to restrict the caracters used in tags, attribute
names and attributes to 'A-Z' (uppercase), '0-9' and '-'.
these are to be found in all character sets I've seen up to
date, and thus able to tunnel whatever needed. For various
data encodeing schemes additional characters may be needed.

By restricting our definition to this, we become independant
from the charset (as long as it includes our characters) and
so transparent to all code conversions that may happen on
the way between machines.

VCF Europa 6.0 am 30.April und 01.Mai 2005 in Muenchen
Received on Thu Aug 12 2004 - 09:06:02 BST

This archive was generated by hypermail 2.3.0 : Fri Oct 10 2014 - 23:36:34 BST