TurboDOS

From: Eric Smith <eric_at_brouhaha.com>
Date: Sat Oct 24 23:58:03 1998

John Lawson <jpl15_at_netcom.com> wrote:
> I have tried scanning manuals a few times, sans much luck. Perhaps
> you could e-mail me privately with some knowledge on getting text
> scanned properly... or maybe my sorry software is braindead in the
> text dept. I have tried making jpegs, but by the time they're
> composited, tweaked, and compressed, they're mostly illegible. :(

Since I'd like to share my philosophy of printed document preparation
with a larger audience, I'm not emailing it privately.

The secret (IMNSHO) is to recognize four things:

1) *NONE* of the available OCR is anywhere near good enough

2) This lifetime is too short to manually fix up the output of the OCR
    process.

3) Despite #1 and #2, it still is worthwhile making scanned images
    available in some form.

4) Even if OCR isn't good enough for document preservation, it's still
    worthwhile as a supplement since you can't grep images.

Once you've resigned yourself to that, the solution is to scan the stuff
at a reasonable resolution (typically 300 DPI), save it as TIFF Class F
files using ITU-T Group 4 (T.6) compression, and run the stuff through
Adobe Acrobat Exchange's "Capture" module in "invisible text" mode.

The capture module will OCR the text to the best of its ability, but
it will save the entire scanned image in the PDF file, so the document
can be displayed or printed in all its original glory (and with all the
original coffee stains, etc.). However, since the capture module does
a fair job of OCR and saves the text in the PDF file with the "invisible"
attribute, the reader can still use the search capabilities.

The resulting document sizes are of course somewhat large, but not
so huge as to be completely unmanageable.

As an example, I have two DECsystem-10 manuals and a portion of a third
currently available from one of my web sites:

        http://www.36bit.org/dec/manual/

Printed Pages Document Size Average Bytes Per Page
------------- ------------- ----------------------
    162 11.9 M 76,959
    514 36.2 M 73,935
     50 2.5 M 53,375

A 36-megabyte file admittedly takes a fair bit of time to download over
a modem link. But the other option was for people not to be able to get
it from me at all, because I don't mind spending hours to scan it once,
but I'm not willing to spend hours making a photocopy EVERY TIME someone
wants a copy.

My web server supports byte-serving, so people running Netscape or IE
with the Acrobat plug-in can browse the documents without having to
download them in their entirety.

So far the only people who have complained about it are people who didn't have
any reason to need the files anyhow. I don't really understand it, but a lot
of people seem to download everything they can get their hands on for no
particular reason. I used to have some very large documents available on my
web site in Postscript files that were available either ZIP'd or tar'd, and I
specifically stated on the web page that the contents were the same, so please
don't download both. An amazing number of people downloaded both anyhow. So
now I only provide things like that in tar files.

Eric
Received on Sat Oct 24 1998 - 23:58:03 BST

This archive was generated by hypermail 2.3.0 : Fri Oct 10 2014 - 23:31:28 BST