Saving old documentation

From: Eric Smith <eric_at_brouhaha.com>
Date: Wed Nov 24 16:39:42 1999

> Can anyone suggest any ways these books could be preserved (or at least,
> have their disintegration slowed down)? I'm inclined to try to scan them
> in and OCR them to preserve the information, but I believe that would
> require me to take the pages out of the binding, destroying the books
> immediately. Can anyone suggest any other preservation methods?

Personally, when things get that bad I'm inclined to remove the binding
to get the best possible scan. Of course, naturally I would never do that
to a manual I didn't own, unless I had the owner's permission.

Anyhow, I've slowly been scanning such things to put on www.36bit.org.

I use a command-line program under Linux to scan the images into .pbm
format at 300 DPI. Then I use pnmtotiff to compress them (losslessly)
into TIFF Class F Group 4 files. This type of compression is very well
suited to text and line art, but not for continuous-tone images; for that
I use JPEG format. Finally, I use a program called "g4pdf" which I hacked
together based on Thomas Metz's "pdflib" and "imagepdf" to import the
images into a PDF (Acrobat) file.

Once the images are in a PDF file, they can be run through Adobe Acrobat
(the full version, not the free Acrobat Reader) for OCR. This is the only
part of the process for which I use commercial software. The OCR isn't
fantastic, which is why I use a mode that preserves the full bitmap and
puts the OCR results as "hidden" text behind the image. That way you can
view the document as scanned, but still have the ability to use the
Acrobat "find" feature for text searches.

Many people complain about not liking PDF, but it's the only format that
both natively supports G4 compression, and for which viewers are fairly
widely available.

People always ask for .GIF, .JPG, or .TIFF files, but they have problems:

GIF doesn't compress nearly as well.

JPG is lossy and introduces artifacts. If you keep the "quality" setting
high, it doesn't introduce as many artifacts, but it doesn't compress well
either.

TIFF Class F Group 4 would be great, but every time I've tried to distribute
files in that format, I've gotten lots of complaints about my TIFF files
being "broken", because people have TIFF viewers that don't handle that
format. (TIFF is *NOT* a single format.)

I've been meaning to write a web page about this, and provide the tools
I use for Linux (and maybe a DOS or Windows port), but I haven't gotten
around to it.

You can see a few of the PDP-10 related documents I've scanned at:

        http://www.36bit.org/dec/manuals/

I've scanned two volumes of the TOPS-10 Software Notebooks since then,
but I haven't yet put them on the web page.

Eric
Received on Wed Nov 24 1999 - 16:39:42 GMT

This archive was generated by hypermail 2.3.0 : Fri Oct 10 2014 - 23:32:30 BST