OCR'ing old manuals

From: Alexander Schreiber <als_at_thangorodrim.de>
Date: Sat Sep 13 13:08:01 2003

On Sat, Sep 13, 2003 at 09:50:22AM -0700, Eric Smith wrote:
> "Antonio Carlini" <arcarlini_at_iee.org> wrote:
> > As long as you scan the stuff now while you have it, you can OCR at your
> > leisure when the technology improves (and requires far less
> > proof-reading).
>
> Note that you should NEVER save scans of text and line art in a lossy
> form such as JPEG. JPEG works for continuous-tone images such as
> photographs by deliberately throwing away high-frequency components.
> Test and line art contain sharp black-to-white transitions (and vice
> versa, of course) which get smeared by this compression, resulting in
> a blurry image.

I can only second that. I've cursed times and again at some fools who
decided to scan some paper documents (fine so far) and use JPEG (lossy
compressing intended for continous tone stuff like photo images) on
black and white scans. The results are ugly, sometimes hard to read and
a bitch to print properly. Oh, and this just made the work of OCRing
this a _lot_ harder.

> For text and line art, a lossless bilevel compression such as G3 or
> G4 fax format (used in some TIFF files), JBIG, JBIG2, Flate (used in
> some PNG files). You can't assume that because you save in TIFF or PNG
> that you get a specific form of compression, since they are very
> broad standards that support multiple compression types.
>
> Sometimes people tell me that JPEG is alright if you only compress
> slightly. The edges still get blurry, and the resulting file size
> is generally *MUCH* larger than if you use G4 or JBIG.

Of course the files are bigger. The lossy algorithm for JPEG was
designed to work on continous toned images (where it works fine) and
just runs into the wall with black and white stuff. Where the algorithm
expects to find lots of low/middle frequency and some high frequency
data, it suddenly is faced with high frequency data alone. No smooth
color value curves that can be nicely compressed. Using JPEG for
compressing black and white is like using a Ferrari for pulling a
trailer full of grain - it gets the stuff moving, but you really, really
should use a proper truck for this job.

> I've written a program to take B&W TIFF files and color or B&W JPEG
> files and produce a PDF file:
> http://tumble.brouhaha.com/

Thanks for writing this program. I'm in the process of archiving the
interesting articles from a stack of computer magazines and am currently
experimenting with the best way to convert dead trees to PDF files.
So far, scanning the paper as lineart at 600 dpi, saving as fax G4
compressed tiff and using tumble to combine those into PDF files yields
the best (best quality, smallest files) results.

Regards,
       Alex.
-- 
"Opportunity is missed by most people because it is dressed in overalls and
 looks like work."                                      -- Thomas A. Edison
Received on Sat Sep 13 2003 - 13:08:01 BST

This archive was generated by hypermail 2.3.0 : Fri Oct 10 2014 - 23:36:25 BST