Lossy compression vs. archiving and OCR (was Re: Many things)

From: Eric Smith <eric_at_brouhaha.com>
Date: Mon Jan 31 14:18:06 2005

Jim Battle wrote about DjVu:
> So for OCR purposes, I don't think this type of compression really hurts
> -- it replaces one plausible "e" image with another one.

No, that's exactly the kind of BS you DO NOT WANT for a file that you
plan to OCR. What if you've got a mathematical formuala that has some
latin "e" letters and some greek epsilons in it? Or perhaps normal
and italic "e" letters? DjVu may well think they are "close enough",
while a good OCR program might be able to tell them apart accurately.

The point of wanting lossless compression is that even if a good
OCR program today can't tell them apart accurately, a good OCR program
ten years from now might.

But if you use lossy compression now, you are likely discarding
information that the OCR program will need.

Received on Mon Jan 31 2005 - 14:18:06 GMT

This archive was generated by hypermail 2.3.0 : Fri Oct 10 2014 - 23:37:46 BST