archiving as opposed to backing up from Tom Jennings on 2004-09-20 (2004-September)

From: Tom Jennings <tomj_at_wps.com>
Date: Mon Sep 20 19:55:46 2004

On Mon, 2004-09-20 at 15:59, Teo Zenios wrote:

...
> Tape is still the best method for archiving files, if its not worth
> the time
> to backup a file to DAT then the file is not worth backing up (keeps
> you
> from archiving things you will never need). I use cdrs for music and
> copying
> other cd's.

Though this is close to OT (twice in one day!), I'd like to comment on
the concept of "backup"; though it takes place usually on new gear, it
totally pertains to old data, and is why we have so little of it!

In my substantial experience, "offline" backup systems are totally, 100%
useless for archival purposes. No machine-readable format has yet been
devised that can be relied upon to persist, period. It's a certainty
none of the crap we buy today will be one either. None.

Exceptional feats, like recovering data from 20-year-old-tapes, mean
nothing more than what they, are, exceptions to world-wide, near-total
loss to bit rot of magnetic media. For every one of those, there's 1000,
100,000 unreadable, lost, mislabelled, tapes containing data in some
inscrutable proprietary format.

Pressed CDs might last a truly long time, but unless there's a
wide-spread and earnest will to produce readers for them (yeah right)
for the long-haul, like there is for fiche/film, they'll go the way of
800bpi 1/2" tape and 3740 floppies.

So far, our long-term (100 year+) options are paper and fiche/film,
which ironically makes one of the oldest discrete-symbol mediums, paper
tape, also one of the longest-lived (and one of the lowest-density!)

Practically-speaking, I use an idea of John Gilmore's, which is brutally
simple: rotating spindles. Not only is it the most reliable system (not
component), it's obvious ephemeral nature is a CONSTANT DAILY REMINDER
of how little stands between our life's work and oblivion and how much
attention data retention requires.

* Like many people, I have machines (owned or shared) with 100% on time,
at home and on the net far away (three cities > 1000mi apart).

* Each computer has a disk drive for data. When that disk throws errors,
makes Bad Noises, or Inappropriate Heat, Etc, it is replaced with great
fanfare, ASAP. Occasionally there is no data loss.

* Every datum I have ever worked on that is machine-readable resides on
my laptop (not including most of my not-very-substantial music
collection, though the same applies here). All email sent and received
back to 1994 (with gaps) etc.

* Using appropriate software (scripts involving rsync) every single
computer has a copy of every single file, updated daily.

Note that this scheme automagically takes care of changes and occasional
improvements in technology and is largely independent of the technology.
It's not archival though, just longish-lived.

I supplement this with burned ISO-9660 CDROMs that I mail out in bundles
to friends in yet other cities, "just toss it up in a closet" that's
just a little added paranoia, but I don't expect those copies to be
useful beyond the next year or two. (The last dump though, gzip.tar of
only crucial, hand-selected "core" data, still took 8CDRs, so I will
probably stop this practice.)

Wildly multiple copies works (or worked) for printed books, sort of, but
with the explosion of digital data it's too hard to tell wheat from
chaff. At this point there really needs to be a *cultural* will to
preserve (librarians are our heroes!) and there was some U.S. gov't
effort in the past (the ruthless profiteers who run the U.S. now pretty
much rules out that, regardless of which party "wins"), but at this
point we all have to play Survivalist and make paper copies or
something.

I'm seriously thinking of printing on paper my entire website.

An aside, but strongly related, is this illuminating (to me) tidbit: OK
so my website has 1100+ hand-edited flat HTML files (vi) and another
3000 or so generated with a perl script. Every few years I ask
acknowledged web content experts about how I could more algorithmically
manage this stuff.

With no exception so far, all of the web experts give me sales spiels --
I talk about data, they talk about programs. Nothing I heave heard or
read about will handle importing existing flat-file data; worse, they
say silly things like "remove HTML header data and import each page"!
Not only is it mechanically impractical, the headers contain
hand-crafted data!

There's no talk of: what happens to my data when Program X is no longer
supported? How do I get my data out? What's the internal data
representation and how are relationships maintained? Ad nausem.

Flat files it will remain. Scripts easily export, verify, import data.

There's little motion I see towards actually SAVING things by anyone
except librarians, and no one listens to them or gives them any money.
Received on Mon Sep 20 2004 - 19:55:46 BST

This archive was generated by hypermail 2.3.0 : Fri Oct 10 2014 - 23:37:30 BST