Lossy compression vs. archiving and OCR (was Re: Many things)

From: Eric Smith <eric_at_brouhaha.com>
Date: Mon Jan 31 14:18:06 2005

Jim Battle wrote about DjVu:
> So for OCR purposes, I don't think this type of compression really hurts
> -- it replaces one plausible "e" image with another one.

No, that's exactly the kind of BS you DO NOT WANT for a file that you
plan to OCR. What if you've got a mathematical formuala that has some
latin "e" letters and some greek epsilons in it? Or perhaps normal
and italic "e" letters? DjVu may well think they are "close enough",
while a good OCR program might be able to tell them apart accurately.

The point of wanting lossless compression is that even if a good
OCR program today can't tell them apart accurately, a good OCR program
ten years from now might.

But if you use lossy compression now, you are likely discarding
information that the OCR program will need.

