paper -> HTML

From: John Foust <jfoust_at_threedee.com>
Date: Wed Dec 30 09:56:12 1998

At 01:15 AM 12/30/98 -0800, The Sam Ismail wrote:
>
>The OCR is OK when the text is just normal, and does remarkably well. But
>I need an OCR suite smarter than Xerox's TextBridge Classic. I also need
>some good post-processing software, or at least need to know how to scan a
>simple black & white document without the scanner introducing blotches and
>crap. Any suggestions?

I've used Caere OmniPage in the past, and it seemed pretty good, but
I wasn't trying to scan old computer docs, just nice typewriter pages.

I'm very interested in the collective wisdom about this, so of course
it seems quite on-topic to me. I'd like to scan the ASR-33 Teletype
manuals, which contain plenty of odd hand-set type, drawings, off-size
pages, schematics, etc. I'd also like to restore the UCSD Pascal
manuals, of which I've heard the only electronic copies at UCSD were
lost a long time ago.

Given these problems of line art and odd character sets, I suspect
the most useful first step would be to scan all docs at a given
resolution, then store them as bitmaps in a format most easily
loaded into any present or future OCR / PDF-ish program. Someone
mentioned the multi-page TIFF format. As for which resolution,
I think 300 DPI might be too coarse.

I like Doug's idea of shooting for HTML. I recall the multi-res
buttons on IBM's patent server, which allows you an easy way
to browse thumbnails, then zoom in on the desired page at various
resolutions. Is there an off-the-shelf tool for doing this?

- John
Received on Wed Dec 30 1998 - 09:56:12 GMT

This archive was generated by hypermail 2.3.0 : Fri Oct 10 2014 - 23:30:51 BST