Let's develop an open-source media archive standard from Jules Richardson on 2004-08-11 (2004-August)

From: Jules Richardson <julesrichardsonuk_at_yahoo.co.uk>
Date: Wed Aug 11 11:48:41 2004

On Wed, 2004-08-11 at 15:57, Patrick Finnegan wrote:
> On Wednesday 11 August 2004 09:50, Jules Richardson wrote:
> > "could be kept" in zip files, yes - but then that's no use in 50
> > years time if someone stumbles across a compressed file and has no
> > idea how to decompress it in order to read it and see what it is :-)
>
> UNIX's compress format has been around for decades now... has its
> patents expired yet? If not, there's always gzip...

Unix goes pop in what, 2038 though? :-)

Maybe zip's not the ideal example. My point really is that if the
archives are enormous, people are going to be tempted to compress them.
If they compress them, what guarantee is there that a) the compression
method is going to be around when someone totally unrelated wants to
handle these files in x years, and b) is it going to be obvious to
someone in x years what compression method was even used to compress the
file?

Again, it's back to longevity of the archives themselves. If something's
needed for the short term (next ten years, say), it's not a problem. But
it'd be nice if a future generation, upon discovering one of these
archives, could know exactly what it was (and stand a good chance of
decoding it) just by looking at it (hence the human-readable part)

> You could uuencode the data... For long term archiving it might be
> suggestable to have plain-text paper copies of the data (hoping in 50+
> years they'll have decent OCR technology :). So, you *need* something
> that only uses some sane subset of characters, like what uuencode or
> BASE64 encoding gives you. Uuencode is a bit older so I'd tend to
> lean towards that over BASE64.

Again, I don't like the idea of anything happening to the archive files
after creation though. I suppose the data from the raw device (floppy,
hard disk, whatever) within the archive could be encoded somehow
(leaving the config section as plain-text) - providing it's in a common
enough format that we think someone will be able to find the spec for
the encoding method in x years and so be able to get at the data. That's
somewhat hard to say for sure though!

> I'm personally prejudiced against XML, but that's just me. : )

It has its uses; as a hierarchical storage mechanism for plain text it's
good IMHO. But I wouldn't use it for small or human-editable config
files, or in any situation where the data you're encapsulating is
totally swamped by the surrounding markup.

> > > I don't quite understand why representing binary as hex would
> > > affect the ability to have command line utilitities.
> >
> > See my posting the other week when I was trying to convert
> > ASCII-based hex data back into binary on a Unix platform :-) There's
> > no *standard* utility to do it (which means there certainly isn't on
> > Windows). If the data within the file is raw binary, then it's just a
> > case of using dd to extract it even if there's no high-level utility
> > available to do it.
>
> You could decode it by hand (ick) or write a Q&D program to do it for
> you. I'd hope that *programming* won't be a lost art in 50+ years.

Well I ended up doing the latter - I was actually quite surprised there
was no standard util to take a stream of hex characters, stripping out
any junk, and writing out the resulting binary (sort of a reverse
hexdump).

> Again, producing paper copies of stuff with non-printable characters
> becomes "problematic".

That's actually an extremely good point, and perhaps the best argument
(IMHO) for not using binary data so far :-) Hmmm...

> Yes. A CRC is *always* a good idea. Or, you could do an ECC even ;)

shush :-)

> I don't really understand why you're quite so concerned about archive
> size bloat,

mainly because I have hard drives swamped with scans of paper
documentation :-) Each page is 'only' 2-3MB but it doesn't half add up
quickly.

> at least over things like CRC's (which if applied liberally
> might add a 4% bloat in size)

well it's the usual case of analysing and weighing up features against
how often they'll actually be needed, and acting accordingly - otherwise
you end up with something like MS Word's document format and I'm sure
nobody wants to end up there! :)

Seriously, if there's a good argument for having CRC's in more than x
(50?) percent of cases because corrupted data expected to be a real
possibility, then make them mandatory. If not, then make them an
optional extra. I certainly can't see a good reason why they'll *never*
be needed, that's for sure.

> or plain-text encoding (which would add
> between 33% to about 100% to the size). I'd rather give up some
> efficiency in this case for ensuring that the data is stored correctly,
> and can be properly read (and easily decoded) in 50+ years.

With you on the longevity side of things. Hmm, off the wall suggestion,
but it's only the storage format for the raw data that's an issue,
right? So does it make sense to define both binary and ASCII
representation as valid storage formats, and the format in use within a
particular archive is recorded as a parameter within the human-readable
config section?

That's no different to most common image formats, say, where data might
be uncompressed, or compressed in several different ways.

Only a fraction of any program capable of producing such an archive
would be given over to the raw data encode stage, so it doesn't add any
complexity really. Utilities to convert between formats should be pretty
trivial; no more complex than decoding the archive in the first place
anyway. Nothing to stop someone at a future date - say in 30 years -
wanting to convert an archive to paper format, even if it's in a format
which doesn't lend itself to this - the spec defines the pure-ASCII
method and so conversion is possible at that point.

In this way those wanting compact archives to save space, run against
various existing utilities etc. can have them containing binary data;
those who think they need ASCII representation of the data due to tool
or transmission medium limitations can use that format - all whilst
maintining compatibility with the spec. (potentially the 'encoding
method' parameter could include other defined types - uuencode, base64
etc. but let's not get ahead of ourselves...)

(funny how someone mentioned IFF files earlier; I keep on thinking of
TIFF images where the data's structured and the format both versioned
and maintained under strict control)

cheers

Jules
Received on Wed Aug 11 2004 - 11:48:41 BST

This archive was generated by hypermail 2.3.0 : Fri Oct 10 2014 - 23:36:33 BST