Lossy compression vs. archiving and OCR (was Re: Many things)

From: Eric Smith <eric_at_brouhaha.com>
Date: Mon Jan 31 15:15:16 2005

Jim Battle wrote:
> When you
> scan to bilevel, exactly where an edge crosses the threshold is subject
> to exact placement of the page, what the scanner's threshold is, and
> probably what phase the 60 Hz AC is since it to some degree couples to
> the lamp brightness (hopefully not much at all, but if you are splitting
> hairs...). Thus there is no "perfect" scan.

Never claimed there was. But I don't want software to DELIBERATELY
muck about with the image, replacing one glyph with another. That's
potentially MUCH WORSE than any effect you're going to get from the
page being shifted or skewed a tiny amount.

> If you are scanning as such a low resolution that two "e"s from
> different fonts might get confused with each other, your OCR attempts
> will be hopeless as well.

But that's what you yourself said that the DjVu software does. It
replaces glyphs with other glyphs that it thinks are similar. No matter
how good a job it thinks it can do of that, I DO NOT WANT IT FOR
ARCHIVAL DOCUMENTS.

I normally scan at 300 or 400 DPI; when there is very tiny text I
sometimes use 600 DPI.

Even at those resolutions, it can be difficult to tell some characters
apart, expecially from poor quality originals. But usually I can do
it if I study the scanned page very closely. No, OCR today cannot do
as good a job at that as I can. Someday OCR may be better. But
arbitrarily replacing the glyphs with other ones the software considers
"good enough" is going to f*&# up any possibility of doing this by
either a human OR OCR.

And all to make the file a little smaller. DVD-R costs about $0.25
to store 4.7GB of data, so I just can't get excited about using lossy
encoding for text and line art pages that usually don't encode with
lossless G4 to more than 50K bytes per page.

Eric
Received on Mon Jan 31 2005 - 15:15:16 GMT

This archive was generated by hypermail 2.3.0 : Fri Oct 10 2014 - 23:37:46 BST