OCR'ing old manuals

From: Eric Smith <eric_at_brouhaha.com>
Date: Sat Sep 13 11:58:00 2003

"Antonio Carlini" <arcarlini_at_iee.org> wrote:
> As long as you scan the stuff now while you have it, you can OCR at your
> leisure when the technology improves (and requires far less
> proof-reading).

Note that you should NEVER save scans of text and line art in a lossy
form such as JPEG. JPEG works for continuous-tone images such as
photographs by deliberately throwing away high-frequency components.
Test and line art contain sharp black-to-white transitions (and vice
versa, of course) which get smeared by this compression, resulting in
a blurry image.

For text and line art, a lossless bilevel compression such as G3 or
G4 fax format (used in some TIFF files), JBIG, JBIG2, Flate (used in
some PNG files). You can't assume that because you save in TIFF or PNG
that you get a specific form of compression, since they are very
broad standards that support multiple compression types.

Sometimes people tell me that JPEG is alright if you only compress
slightly. The edges still get blurry, and the resulting file size
is generally *MUCH* larger than if you use G4 or JBIG.

JBIG gets 10-20% better compression than G4, but it is patented, so I
don't use it. Flate usually compresses somewhat better than G4, and
is not patented, but I'm not sure how it compares to JBIG. I'm not
using it because support is not widespread yet. G4 works well because
it can be wrapped in PDF and used by any PDF viewer.

Bilevel compression doesn't work well on continuous tone images, so
JPEG should be used for those.

The main dilemma for scanning is pages that contain a mix of text/
line art and continuous tone images. My personal reccomendation for
these is to either scan the page twice, in B&W and color (or gray scale),
or to scan it once to uncompressed color (or gray scale) then convert to
both bilevel and JPEG in software.

Apparently some "best practice" policies for document archiving
specifically state that a document should only be scanned once. I
think they're just trying to minimize handling of fragile documents,
so I don't think they really mean that taking two scans of a page
(consecutively without manipulating the physical document) is bad.
These "best practice" policies also recommend a minimum of 600 DPI,
which is reasonable for continuous tone images but is normally
overkill for text and line art. I typically use 300 or 400 DPI.

I've written a program to take B&W TIFF files and color or B&W JPEG
files and produce a PDF file:
    http://tumble.brouhaha.com/

My future plans for tumble include compositing text and line art with
continuous tone images on a single page. I've got a script for GIMP
to take an uncompressed color or gray scale scan (in PGM or PPM format),
allow manual selection of the continuous tone images, then save two
separate files. I've been thinking about trying to automate this by
having the filter use histograms and FFT to locate the images. I'm not
sure when I'll have time to work on this further, though.

Eric
Received on Sat Sep 13 2003 - 11:58:00 BST

This archive was generated by hypermail 2.3.0 : Fri Oct 10 2014 - 23:36:25 BST