Let's develop an open-source media archive standard

From: Vintage Computer Festival <vcf_at_siconic.com>
Date: Wed Aug 11 13:15:18 2004

On Wed, 11 Aug 2004, Jules Richardson wrote:

> > I would encode binary data as hex to keep everything ascii. Data size would expand,
> > but the data would also be compressable so things could be kept in ZIP files of
> > whatever choice a person would want to for their archiving purposes.
>
> "could be kept" in zip files, yes - but then that's no use in 50 years
> time if someone stumbles across a compressed file and has no idea how to
> decompress it in order to read it and see what it is :-)
> Hence keeping the archive small would seem sensible so that it can be
> left as-is without any compression. My wild guesstimate on archive size
> would be to aim for 110 - 120% of the raw data size if possible...

I agree, but we'd definitely have to include compression features if we
are to meet this goal. Using a floppy disk as an example, a worst case
scenario is that the image would be maybe 205% the size of the original
media (200% is the fact that you are now using two bytes to store one, and
5% is all the markup tags).

Keeping the archive small should be a major goal, since that would
encourage people to keep the images stored uncompressed. Hard drives are
getting larger and all that, and my guess is at some point this issue will
be moot, but we can't know that for certain, so we should always assume a
worst case scenario (i.e. pessimism will be useful when designing this
specification :)

> > Certainly more cpu cycles are needed for conversion and image file size is
> > larger, but we need a readable format
>
> But the data describing all aspects of disk image would be readable by a
> human; it's only the raw data itself that wouldn't be - for both
> efficiency and for ease of use. The driving force for having

This is a point that needs to be highlighted. These images are meant to
be human readable, first and foremost. Machine readable is a secondary
concern. We know there will definitely be humans in the future (and if
not then who cares about this anyway). There will probably be machines.
Said machines may not be useful to the task of decoding these images, so
it must be designed with human readability in mind.

> human-readable data in the archive is so that it can be reconstructed at
> a later date, possibly without any reference to any spec, is it not? If

Indeed.

> it was guaranteed that a spec was *always* going to be available, having
> human-readable data at all wouldn't make much sense as it just
> introduces bloat; a pure binary format would be better.

Correct. So even if the spec was lost, people (who could read English at
least) would be able to figure out how to reconstruct the image from the
archive.

> I'm not quite sure what having binary data represented as hex for the
> original disk data gives you over having the raw binary data itself -
> all it seems to do is make the resultant file bigger and add an extra
> conversion step into the decode process.

But it also makes it human readable, and readable in any standard text
editor. Mixing binary data in with human readable data in a format that's
meant, first and foremost, to be human readable is antithetical to the
idea.

> As for file size, if encoding as hex that at least doubles the size of
> your archive file compared to the original media (whatever it may be).
> That's assuming no padding between hex characters. Seems like a big
> waste to me :-(

Nope. Not a waste. Essential.

> Yep, I'm with you there. CRC's are a nice idea. Question: does it make
> sense to make CRC info a compulsory section in the archive file? Does it

Yes. It's only one or two added bytes at the end of each data segment.

> make sense to have it always present, given that it's *likely* that
> these archive files will only ever be transferred from place to place
> using modern hardware? I'm not sure. If you're spitting data across a
> buggy serial link, then the CRC info is nice to have - but maybe it
> should be an optional inclusion rather than mandatory, so that in a lot
> of cases archive size can be kept down? (and the assumption made that
> there exists source code / spec for a utility to add CRC info to an
> existing archive file if desired)

It doesn't hurt. It only adds negligible overhead. Certainly something
to discuss more.

I would make the specification unassuming about anything like this. For
example, say there is an optional CRC feature. I would make the default
for the image be that there was no CRC added to the data segments, unless
a meta tag was included in the header explicitly specifying that CRCs are
added. This makes it ever so slightly easier to decode the image data by
someone who knows nothing of the spec. No assumptions are made regarding
what people in the future will know about these images.

-- 
Sellam Ismail                                        Vintage Computer Festival
------------------------------------------------------------------------------
International Man of Intrigue and Danger                http://www.vintage.org
[ Old computing resources for business || Buy/Sell/Trade Vintage Computers   ]
[ and academia at www.VintageTech.com  || at http://marketplace.vintage.org  ]
Received on Wed Aug 11 2004 - 13:15:18 BST

This archive was generated by hypermail 2.3.0 : Fri Oct 10 2014 - 23:36:33 BST