Let's develop an open-source media archive standard

From: Patrick Finnegan <pat_at_computer-refuge.org>
Date: Wed Aug 11 10:57:47 2004

On Wednesday 11 August 2004 09:50, Jules Richardson wrote:
> On Wed, 2004-08-11 at 13:13, Steve Thatcher wrote:
> > I would encode binary data as hex to keep everything ascii. Data
> > size would expand, but the data would also be compressable so
> > things could be kept in ZIP files of whatever choice a person would
> > want to for their archiving purposes.
>
> "could be kept" in zip files, yes - but then that's no use in 50
> years time if someone stumbles across a compressed file and has no
> idea how to decompress it in order to read it and see what it is :-)

UNIX's compress format has been around for decades now... has its
patents expired yet? If not, there's always gzip...

> Hence keeping the archive small would seem sensible so that it can be
> left as-is without any compression. My wild guesstimate on archive
> size would be to aim for 110 - 120% of the raw data size if
> possible...

You could uuencode the data... For long term archiving it might be
suggestable to have plain-text paper copies of the data (hoping in 50+
years they'll have decent OCR technology :). So, you *need* something
that only uses some sane subset of characters, like what uuencode or
BASE64 encoding gives you. Uuencode is a bit older so I'd tend to
lean towards that over BASE64.

> > XML has become a storage format choice for a lot of different
> > commercial packages. My knowledge is more based on the windows
> > world, but I would doubt that other computer software houses are
> > avoiding XML. Sun/Java certainly embraces it.
>
> My background (in XML terms) is with Java - but I've not come across
> software that mixes a human-readable format in with a large amount of
> binary data though (whether encoded or not). Typically the metadata's
> kept seperate from the binary data itself, either in parallel files
> (not suitable in our case) or as seperate sections within the same
> file.

I'm personally prejudiced against XML, but that's just me. : )

> > I don't quite understand why representing binary as hex would
> > affect the ability to have command line utilitities.
>
> See my posting the other week when I was trying to convert
> ASCII-based hex data back into binary on a Unix platform :-) There's
> no *standard* utility to do it (which means there certainly isn't on
> Windows). If the data within the file is raw binary, then it's just a
> case of using dd to extract it even if there's no high-level utility
> available to do it.

You could decode it by hand (ick) or write a Q&D program to do it for
you. I'd hope that *programming* won't be a lost art in 50+ years.

> > Certainly more cpu cycles are needed for conversion and image file
> > size is larger, but we need a readable format
>
> I'm not quite sure what having binary data represented as hex for the
> original disk data gives you over having the raw binary data itself -
> all it seems to do is make the resultant file bigger and add an extra
> conversion step into the decode process.

Again, producing paper copies of stuff with non-printable characters
becomes "problematic".

> > and I would think that cpu cycles is not as much of a concern or
> > file size.
>
> In terms of CPU cycles, for me, no - I can't see me ever using the
> archive format except on modern hardware. I can't speak for others on
> the list though.
>
> As for file size, if encoding as hex that at least doubles the size
> of your archive file compared to the original media (whatever it may
> be). That's assuming no padding between hex characters. Seems like a
> big waste to me :-(

Then use uuencode or similar that does a bit less wasteful conversion.
Anyways, the only computer media type where KB/in^2 (or KB/in^3) isn't
increasing rapidly is paper.

> > The only difference I see in the sections that were described is
> > that the first one encompasses the format info and the data. My
> > description had the first one as being a big block that contained
> > the two other sections as well as length and CRC info to verify
> > data consistency. Adding author, etc to the big block would make
> > perfect sense.
>
> Yep, I'm with you there. CRC's are a nice idea. Question: does it
> make sense to make CRC info a compulsory section in the archive file?
> Does it make sense to have it always present, given that it's
> *likely* that these archive files will only ever be transferred from
> place to place using modern hardware? I'm not sure. If you're
> spitting data across a buggy serial link, then the CRC info is nice
> to have - but maybe it should be an optional inclusion rather than
> mandatory, so that in a lot of cases archive size can be kept down?
> (and the assumption made that there exists source code / spec for a
> utility to add CRC info to an existing archive file if desired)

Yes. A CRC is *always* a good idea. Or, you could do an ECC even ;)

I don't really understand why you're quite so concerned about archive
size bloat, at least over things like CRC's (which if applied liberally
might add a 4% bloat in size) or plain-text encoding (which would add
between 33% to about 100% to the size). I'd rather give up some
efficiency in this case for ensuring that the data is stored correctly,
and can be properly read (and easily decoded) in 50+ years.

Pat
-- 
Purdue University ITAP/RCS        ---  http://www.itap.purdue.edu/rcs/
The Computer Refuge               ---  http://computer-refuge.org
Received on Wed Aug 11 2004 - 10:57:47 BST

This archive was generated by hypermail 2.3.0 : Fri Oct 10 2014 - 23:36:33 BST