Many things from Jim Battle on 2005-01-31 (2005-January)

From: Jim Battle <frustum_at_pacbell.net>
Date: Mon Jan 31 08:14:59 2005

Jim Leonard wrote:
> Eric Smith wrote:
>
>> By default, DjVu uses lossy compression, which is
>> significantly smaller. But for archival purposes, I *much* prefer the
>> lossless coding. Should I decide to OCR the documents at a later date,

I have a comment on that below.

>
> To be fair, DjVu lossy-encodes the graphics, not the text. One of the
> "selling points" of the format is that B&W text is kept on its own
> lossless layer.

On my Wang site, I've put up all of my documents as both djvu (because
they are much smaller -- often 1/3 the size) and as PDFs because almost
nobody has djvu.

DJVU doesn't really know what is text and what is graphics. It has a
few different compression algorithms and applies each to regions of an
image and uses which is best. I can show you a table of contents where
part of it is bilevel and part is grayscale. This happened because the
person who sent it to me scanned the document at low resolution grayscale.

Although it doesn't really know text is per-se, one of its algorithms is
to find glyph-like things. Once it has all glyph-like things isolated
on a page, it compares them all to each other and if two glyphs are
similar enough, it will just represent them both (or N of them) with one
compressed glyph image.

So for OCR purposes, I don't think this type of compression really hurts
-- it replaces one plausible "e" image with another one.

At work the high speed copier/printer/scanner can process 600 dpi
bilevel images at an amazing rate. I can just ftp my images from the
scanner after it cranks through a batch of pages. Unfortunately, the
copier is very agressive about dithering and as a result, pure text
pages are a lot more "furry" -- I don't know how to describe it -- the
edges and interiors of text have a lot more pixel noise. As a result,
the ratio of PDF/djvu sizes is a lot smaller than pages where I've used
my desktop scanner.
Received on Mon Jan 31 2005 - 08:14:59 GMT

This archive was generated by hypermail 2.3.0 : Fri Oct 10 2014 - 23:37:46 BST