PDF, DjVu, scanned pages, text and line art vs. continuous-tone images (was Re: Many things)

From: Eric Smith <eric_at_brouhaha.com>
Date: Mon Jan 31 14:13:27 2005

I wrote:
> By default, DjVu uses lossy compression, which is
> significantly smaller.

Jim wrote:
> To be fair, DjVu lossy-encodes the graphics, not the text. One of the
> "selling points" of the format is that B&W text is kept on its own
> lossless layer.

If so, that's apparently a recent change. When I experimented with the
demo of the commercial product about 18 months ago, it definitely was using
lossy coding on the entire image. When I used it to encode a bitmap
of a page of text (from a TIFF encoded with lossless G4 coding) into
a DjVu file, and extracted it again, it was not able to reconstruct
pixel-identical output. I xor'd the original with the extracted file,
and there were differences at all of the character edges. It is my
belief that this is why it was able to store the file in only about
70% of the space occupied by the G4 lossless version.

I'm not disputing that the DjVu format is *capable* of doing what you
say, but in my experience that's not what the available software actually
did.

Jim wrote in another message:
> DjVu has other advantages, such as local/window/viewport decoding of
> images with ludicrously high dimensions/resolutions but I understand
> your point.

I'm not sure I fully understand, but it doesn't sound like anything
that the PDF format can't support. I would rather invest effort into
improving the capabilities of free PDF viewer software such as xpdf
rather than pushing a different standard.

> Where are the tools to create DjVu-like PDF files?

I've been doing it myself with an experimental version of my "tumble"
program. With that, it's an entirely manual process. I have to split
the continuous-tone images into a separate layer or file using a
separate editor (such as Gimp). Then I use tumble to compose a page
with the background as G4 and the images as JPEG.

I also am experimenting with using this for pages that are only text
and line art, but that have a few colors. For instance, DEC manuals
that had user input printed in blue. I separate the blue text into
another layer, and G4 encode it separately from the black text. PDF
has an imagemask operator that can be used to draw a blue rectangle
the size of the whole page, but clipped to the blue G4 image. Acrobat
reader handles this correctly, but some other PDF processing programs
do not.

I hope to automate the multi-color text problem in tumble using code
derived from Tim Shoppa's "timify.c". Automating the detection and
processing of continuous tone images is in my plans as well, but further
out. There don't seem to be any good published algorithms for image
detection, so I'll have to experiment with it. As a first step I plan
to do 2D FFTs on areas of the page; text and line art should predominantly
have DC and high frequencies, while continuous-tone images should have
a more even frequency distribution.

So far I'm doing this work by scanning a page twice, once in bilevel and
once in greyscale or color. I do that because the published algorithms
for converting greyscale text and line art to bilevel (thresholding) are
nowhere near as good as what's done in a good scanner. Picture Elements
makes a PCI card that can be used to do this (and even works with Linux),
but it's very expensive so I really want a software-only solution.

But the experimental version of tumble is still very buggy and not
yet publicly released, so it won't help you any.

Acrobat can do the manual approach as well, if you use a separate editor
to split the page into layers, and import each layer separately.

It appears that there are some expen$ive programs that do such things
in a more automated fashion listed on pdfzone. Also I'm told that some
OCR packages can do this, but I haven't verified it.

Eric
Received on Mon Jan 31 2005 - 14:13:27 GMT

This archive was generated by hypermail 2.3.0 : Fri Oct 10 2014 - 23:37:46 BST