Let's develop an open-source media archive standard

From: Hans Franke <Hans.Franke_at_mch20.sbs.de>
Date: Wed Aug 11 08:17:46 2004

Am 11 Aug 2004 12:08 meinte Jules Richardson:

> On Wed, 2004-08-11 at 10:50, Steve Thatcher wrote:
> > Hi all, after reading all this morning's posts, I thought I would throw out some thoughts.
> > XML as a readable format is a great idea.

> I haven't done any serious playing with XML in the last couple of years,
> but back when I did, my experience was that XML is not a good format for
> mixing human-readable and binary data within the XML structure itself.

Only if you intend to keep it 100% human readable.

> To make matters worse, the XML spec (at least at the time) did not
> define whether it was possible to pass several XML documents down the
> same data stream (or, as we'd likely need for this, XML documents mixed
> with raw binary). Typically, parsers of the day expected to take control
> of the data stream and expected it to contain one XML document only -
> often closing the stream themselves afterwards.

Now, that's a feature of the reading application. XML does not
stat what happens next since this is outside the scope. It is
perfectly op to look for the next Document, or the next start
tag of the same document type, or for whatever.

> I did end up writing my own parser in a couple of KB of code which was a
> little more flexible in data stream handling (so XML's certainly not a
> heavyweight format, and could likely be handled on pretty much any
> machine), but it would be nice to make use of off-the-shelf parsers for
> platforms that have them where possible.

Right, but especialy when we're coming down to classic platforms,
such building blocks are not always usable, and in general way
oversized. On a 48k Apple (or a 64 K 4 MHz CP/M machine) we don't
have the space to just port a C-app that has 'only' 100k of code
size. So reader/writer applications for the original environment
have to be small and special to type.

> As you've also said, my initial thought for a data format was to keep
> human-readable config seperate from binary data. The human-readable
> config would contain a table of lengths/offsets for the binary data
> giving the actual definition. This does have the advantage that if the
> binary data happens to be a linear sequence of blocks (sectors in the
> case of a disk image) then the raw image can easily be extracted if
> needs be (say, to allow conversion to a different format)

Well, that is only true if you define binary data as 8 Bit and
all means of transport as 100% transparent. Just, this hasn't
worked that way in the past, and I doubt that we will be safe
from changes in the future.

As for the character size: we had in the past everything from
6 to 12 Bit (ok, I can't remember 11 Bit characters/words) as
'binary' characters. Of course 6,7 and 8 Bit Bytes can be easy
stored in a 8 Bit Byte, but what about 9 Bit (Bull) or 12 (DEC)?
At that point you already have to incooperate speciual trans-
formation rules which are not necersary transparent.

Also for the requirement of a transparent transport: When
transfering files between different architectures we usualy
have code or even format conversions. Most notable code
conversion would be, for example, ISO 8859-1 <-> EBCDIC which
totally destroys the 'binary' part. Or take format conversions
as done on the way between Unix style files and (Win-)DOS, LF
vs CR/LF. Whenever you leave the A-Z and 0-9 range we are
likely to encounter such problems.

Shure, one could code an app capable to read ASCII/Binary on
a EBCDIC Machine and vice versa, but to my experience (doing
programming since 25 years in mixed environments) it's not
only a boring job, but also one of the most sensitive to

Any kind of standard format must be true machine independent.
Thus (at least when using the recommended representation) be
able to be transferred across all platforms thinkable of.

> > I looked at the CAPS format and in part that would be okay. I would like
> > to throw in an idea of whatever we create as a standard actually have
> > three sections to it.

> So, first section is all the 'fuzzy' data (author, date, version info,
> description etc.), second section describes the layout of the binary
> data (offsets, surfaces, etc.), and the third section is the raw binary
> data itself? If so, I'm certainly happy with that :-)

I would rather go for an anoted format, where more detailed
information can be added at any point, and not necersary
in certain sections. Especialy since the 'fuzzy' data is
usualy not needed for the job itself.

> One aside - what's the natural way of defining data on a GCR floppy? Do
> heads/sectors/tracks still make sense as an addressing mode, but it's
> just that the number of sectors per track varies according to the track
> number? Or isn't it that simple?

Well, that's already outside of what a standard definition
can define without doubt.

To my understanding interpretation of Data is always part
of a real application. As soon as it touches machine or
format specific implementation details a standard may only
give guidelines how to store them properly, but not how to
interprete. That's part of an actual reader implementation.
And each rader will of course only understand parts he's
made for - e.g. a Apple DOS 3.3 reader will have no idea
what a tape label for a IBM tape is not to mention be able
to differentiate between the various header types.

Reader/Writer apps will always be as specific as they are
right now, when handling a proprietary format. The big
advantage is that intermediate tools, like archiving,
indexing, etc.pp can be shared. Well, in fact it's the
only advantage, except the fact that one doesn't have to
figure out a new format each time, and the simple format
does allow the ad hoc inclusion of new machines/systems.


VCF Europa 6.0 am 30.April und 01.Mai 2005 in Muenchen
Received on Wed Aug 11 2004 - 08:17:46 BST

This archive was generated by hypermail 2.3.0 : Fri Oct 10 2014 - 23:36:34 BST