Lossy compression vs. archiving and OCR (was Re: Many things)

From: Jim Battle <frustum_at_pacbell.net>
Date: Mon Jan 31 17:48:25 2005

Eric Smith wrote:

> Jim Battle wrote:
>
>>When you
>>scan to bilevel, exactly where an edge crosses the threshold is subject
>>to exact placement of the page, what the scanner's threshold is, and
>>probably what phase the 60 Hz AC is since it to some degree couples to
>>the lamp brightness (hopefully not much at all, but if you are splitting
>>hairs...). Thus there is no "perfect" scan.
>
> Never claimed there was. But I don't want software to DELIBERATELY
> muck about with the image, replacing one glyph with another. That's
> potentially MUCH WORSE than any effect you're going to get from the
> page being shifted or skewed a tiny amount.

"potentially" is the key word. if the encoding software is crappy, then
they such a substitution could turn all "e"s into "x"s. sure. but the
djvu encoder doesn't make gross substititutions like that.

Contrary to what you say, skew has a much larger effect on the sampling
than djvu's encoders have. Which scanner you use has a much larger
effect on the sampling too.

...
> I normally scan at 300 or 400 DPI; when there is very tiny text I
> sometimes use 600 DPI.
>
> Even at those resolutions, it can be difficult to tell some characters
> apart, expecially from poor quality originals. But usually I can do
> it if I study the scanned page very closely. No, OCR today cannot do
> as good a job at that as I can. Someday OCR may be better. But
> arbitrarily replacing the glyphs with other ones the software considers
> "good enough" is going to f*&# up any possibility of doing this by
> either a human OR OCR.

Eric, in picking a case where the djvu algorithm *might* cause problems,
you must also confess that in this case scanning in bilevel, even
lossless, is going to be a bad choice too. If the page is that poor,
you should be using grayscale.

Why be religious about lossiness and claim anything less is going to
"f*&#" up your efforts when you've just tossed away the bulk of the
information?

> And all to make the file a little smaller. DVD-R costs about $0.25
> to store 4.7GB of data, so I just can't get excited about using lossy
> encoding for text and line art pages that usually don't encode with
> lossless G4 to more than 50K bytes per page.

"A little" can be 3x. For distribution, it is a big deal. Until
recently, it made a signficant difference on disk price too, but now
that you can get 120 GB hard drives in a box of cereal, that isn't so
much of a concern.

Of course you can use whatever format you want for your archiving.
Making it available in a more accessible format means that more people
are likely to take advantage of it.

For most documents, it is the information that I care about preserving,
not the pixels. I would be overjoyed if Adobe would buy out lizardtech
and adopt some of their technology, even the lossy bits.
Received on Mon Jan 31 2005 - 17:48:25 GMT

This archive was generated by hypermail 2.3.0 : Fri Oct 10 2014 - 23:37:46 BST