Jim Leonard wrote:
> Eric Smith wrote:
> 
>> By default, DjVu uses lossy compression, which is
>> significantly smaller.  But for archival purposes, I *much* prefer the
>> lossless coding.  Should I decide to OCR the documents at a later date,
I have a comment on that below.
> 
> To be fair, DjVu lossy-encodes the graphics, not the text.  One of the 
> "selling points" of the format is that B&W text is kept on its own 
> lossless layer.
On my Wang site, I've put up all of my documents as both djvu (because 
they are much smaller -- often 1/3 the size) and as PDFs because almost 
nobody has djvu.
DJVU doesn't really know what is text and what is graphics.  It has a 
few different compression algorithms and applies each to regions of an 
image and uses which is best.  I can show you a table of contents where 
part of it is bilevel and part is grayscale.  This happened because the 
person who sent it to me scanned the document at low resolution grayscale.
Although it doesn't really know text is per-se, one of its algorithms is 
to find glyph-like things.  Once it has all glyph-like things isolated 
on a page, it compares them all to each other and if two glyphs are 
similar enough, it will just represent them both (or N of them) with one 
compressed glyph image.
So for OCR purposes, I don't think this type of compression really hurts 
-- it replaces one plausible "e" image with another one.
At work the high speed copier/printer/scanner can process 600 dpi 
bilevel images at an amazing rate.  I can just ftp my images from the 
scanner after it cranks through a batch of pages.  Unfortunately, the 
copier is very agressive about dithering and as a result, pure text 
pages are a lot more "furry" -- I don't know how to describe it -- the 
edges and interiors of text have a lot more pixel noise.  As a result, 
the ratio of PDF/djvu sizes is a lot smaller than pages where I've used 
my desktop scanner.
Received on Mon Jan 31 2005 - 08:14:59 GMT
This archive was generated by hypermail 2.3.0
: Fri Oct 10 2014 - 23:37:46 BST