Let's develop an open-source media archive standard

From: Kevin Handy <kth_at_srv.net>
Date: Wed Aug 11 11:09:43 2004

Jules Richardson wrote:

>On Wed, 2004-08-11 at 13:13, Steve Thatcher wrote:
>
>
>
>> I would encode binary data as hex to keep everything ascii. Data size would expand,
>>but the data would also be compressable so things could be kept in ZIP files of
>>whatever choice a person would want to for their archiving purposes.
>>
>>
>
>"could be kept" in zip files, yes - but then that's no use in 50 years
>time if someone stumbles across a compressed file and has no idea how to
>decompress it in order to read it and see what it is :-)
>Hence keeping the archive small would seem sensible so that it can be
>left as-is without any compression. My wild guesstimate on archive size
>would be to aim for 110 - 120% of the raw data size if possible...
>
>
>
I'd believe that ZIP archives will be well known in the future, due to
the large number of software products using it right now, and the
length of time is has existed so far. (Famous last words?)

The compression methods are fairly well known, and there exist several
open source libraries to handle it, aswell as being embeded in many new
projects.

OpenOffice/StarOffice, as an example, uses zipped xml files as its
native save format. This fact alone should help keep the ZIP format
alive with the large number of groups/companies/governments/countries
switching to OpenOffice.

>>XML has become a storage format choice for a lot of different commercial
>>packages. My knowledge is more based on the windows world, but I would doubt
>> that other computer software houses are avoiding XML. Sun/Java certainly
>>embraces it.
>>
>>
>
>My background (in XML terms) is with Java - but I've not come across
>software that mixes a human-readable format in with a large amount of
>binary data though (whether encoded or not). Typically the metadata's
>kept seperate from the binary data itself, either in parallel files (not
>suitable in our case) or as seperate sections within the same file.
>
>
>
Binary data is usually encoded in an XML file into a some type of text:
base64, hex/octal, etc. You don't want actual raw binary data embeded
in an XML document.

>>I don't quite understand why representing binary as hex would affect the
>>ability to have command line utilitities.
>>
>>
>
>See my posting the other week when I was trying to convert ASCII-based
>hex data back into binary on a Unix platform :-) There's no *standard*
>utility to do it (which means there certainly isn't on Windows). If the
>data within the file is raw binary, then it's just a case of using dd to
>extract it even if there's no high-level utility available to do it.
>
>
>
But how do you know that DD will exist in 50 years?

It isn't hard to write a simple hex-bin and bin-hex program.
Couple of lines of perl, and you are done. The main problem is
there isn't a real hex dump standard. There are several hex
conversions in 'recode' on linux.

>>Certainly more cpu cycles are needed for conversion and image file size is
>>larger, but we need a readable format
>>
>>
>
>But the data describing all aspects of disk image would be readable by a
>human; it's only the raw data itself that wouldn't be - for both
>efficiency and for ease of use. The driving force for having
>human-readable data in the archive is so that it can be reconstructed at
>a later date, possibly without any reference to any spec, is it not? If
>it was guaranteed that a spec was *always* going to be available, having
>human-readable data at all wouldn't make much sense as it just
>introduces bloat; a pure binary format would be better.
>
>I'm not quite sure what having binary data represented as hex for the
>original disk data gives you over having the raw binary data itself -
>all it seems to do is make the resultant file bigger and add an extra
>conversion step into the decode process.
>
>
>
Binary data can get badly munged during transmission across the internet,
with ascii-ebcdic conversions, 7-bit/8-bit paths, etc. Not as much a problem
as it used to be. A text version doesn't suffer these problems.

>>and I would think that cpu cycles is not as much of a concern or file size.
>>
>>
>
>In terms of CPU cycles, for me, no - I can't see me ever using the
>archive format except on modern hardware. I can't speak for others on
>the list though.
>
>As for file size, if encoding as hex that at least doubles the size of
>your archive file compared to the original media (whatever it may be).
>That's assuming no padding between hex characters. Seems like a big
>waste to me :-(
>
>
>
If you then ZIP the file, it will likely become smaller than the original.
With the availability of programs like zcat/zmore, reading it wouldn't be
much of a problem either.

>>The only difference I see in the sections that were described is that
>>the first one encompasses the format info and the data. My description
>>had the first one as being a big block that contained the two other sections
>>as well as length and CRC info to verify data consistency.
>> Adding author, etc to the big block would make perfect sense.
>>
>>
>
>Yep, I'm with you there. CRC's are a nice idea. Question: does it make
>sense to make CRC info a compulsory section in the archive file? Does it
>make sense to have it always present, given that it's *likely* that
>these archive files will only ever be transferred from place to place
>using modern hardware? I'm not sure. If you're spitting data across a
>buggy serial link, then the CRC info is nice to have - but maybe it
>should be an optional inclusion rather than mandatory, so that in a lot
>of cases archive size can be kept down? (and the assumption made that
>there exists source code / spec for a utility to add CRC info to an
>existing archive file if desired)
>
>
>
The necessity of CRC's depend on what you plan on doing with the data.
If it is just going to be sitting in a nice, safe box, then it doesn't
matter.
It it is going to be tossed all over the place, through strange networks,
around the world, bounced off mars, etc. then it becomes much more
important.

Depending on how large a chunk you CRC, the size differencial can be
very minimal. A CRC for every byte, expensive, for one track, minimal.
Received on Wed Aug 11 2004 - 11:09:43 BST

This archive was generated by hypermail 2.3.0 : Fri Oct 10 2014 - 23:36:33 BST