paper -> HTML (and The First PC)

From: Uncle Roger <sinasohn_at_ricochet.net>
Date: Wed Jan 6 20:52:21 1999

At 01:49 AM 12/30/98 -0600, you wrote:
> 1) do a color scan to grab images
> 2) clean up images
> 3) resize based on guess at a good size and res for web pages

Don't think you can do much about these steps. I usually shoot for 320x240
pixels for web images -- on most monitors that's about 4" by 3" or so. It
used to be (not sure if this is still true) that the default width for
netscape on the mac gave you 400 pixels across; I believe on the PC it was
480? (I'll have to dig up that article again.) Also, that's a manageable
size for downloads.

Anyway, on a 640x480 screen, you lose some width for scroll bars and all;
plus you need a border/margin... Sure, you can design your web pages for
800x600, if you don't care that most people won't be able to see it all at
once.

> 4) scan again as B/W line art
> 5) OCR
> 6) clean up OCR
> 7) create HTML combining OCR'd text and images
>
>I don't much like PDF for web docs, so an HTML solution would be best. It
>looks like the "pro" version of Xerox's OCR software might automate the
>task somewhat. Any recommendations?

Well, your main issue is getting the text into machine-readable format. My
current belief is the best way (especially for lower quality originals) is
to read them into a word processor using dragon dictate or similar. Once
you've got a text file, there are several options to get them HTML'ized,
including things like MS Word, and HTML editors. (I prefer doing it
manually.) Depending on what you're doing, a CGI program that
reads/formats text files, inserting images as necessary, might be the way
to go. (See http://www.sinasohn.com/clascomp/index2.htm) for an example.)



--------------------------------------------------------------------- O-

Uncle Roger "There is pleasure pure in being mad
roger_at_sinasohn.com that none but madmen know."
Roger Louis Sinasohn & Associates
San Francisco, California http://www.sinasohn.com/
Received on Wed Jan 06 1999 - 20:52:21 GMT

This archive was generated by hypermail 2.3.0 : Fri Oct 10 2014 - 23:32:04 BST