Many things from Antonio Carlini on 2005-01-31 (2005-January)

From: Antonio Carlini <a.carlini_at_ntlworld.com>
Date: Mon Jan 31 17:47:32 2005

> I maintain that PDF should not be used merely as a container
> for existing graphics files because there is normally
> no easy free way to extract the image data and use it in
> another program.

I think ImageMagik will do this sort of thing quite happily.
As for using PDF as a container for scanned images: that's
actually one way I think it *should* be used.

Perfect OCR would be better than G4 TIFF living in a PDF
wrapper. When such OCR exists, I'll happily run everything
I have through it - then we'll have tiny PDFs and we'll
easily be able to turn them into text or html on the fly
for those who want them that way.

The problem today is that perfect (or even near perfect) OCR
is a long way off, despite the strides that seem to have been
taken lately. A typical document I scan will have 200-300
pages. Just how many errors would you be willing to tolerate
in such a document? It's even worse when you realise that
there are plenty of technical phrases and the occasional
semi-mathematical expression.

> majority of users who do this screw it up massively (I'm
> thinking 150 DPI JPGs of scanned text).

The the scan was born screwed. Shoving it into or dragging it
out of a convenient viewing wrapper makes no difference to
its essential nature. If this is the only scan, then it is
better than nothing. Otherwise, use someone else's scan. In
both cases a gentle email pointing out the problem might help
prevent future "issues".

> Where are the tools to create DjVu-like PDF files? The best
> Acrobat can do is
> OCR text but still leave the source bitmap in place... If I
> scan in a page
> with a background color image with B&W text foreground, where
> are the PDF tools
> to properly handle layer seperation? (Not CMYK seperation,
> you know what I
> mean :-)

I've had real trouble with RSTS and RT-11 documents which
contained mostly B&W text but with some examples of output
in colour, sections that apply to only one or other format
on a grey-, or pink-shaded background and foreground text
sometimes in dark red or blue. The original scans are huge
(colour, possibly 24-bit, TIFF). I've not found any reasonable
way of automatically post-processing them to produce something
reasonable. It *should* be possible to split the B&W text
out and into its own G4 encoded layer, do the same for
the various colours of text (each encoded as bi-level
G4 and then the layer marked as "display this in colour X")
and blocks of shading in further layers for the background
where applicable. If DejaVu can do this, I'm all ears!

Antonio

-- 
---------------
Antonio Carlini arcarlini_at_iee.org

Received on Mon Jan 31 2005 - 17:47:32 GMT

This archive was generated by hypermail 2.3.0 : Fri Oct 10 2014 - 23:37:46 BST