Contact Us @

News and Trends

News & Trends


Basics of Document Capture

mage Capture - Capture is the most basic building block of any imaging application. It is the process of converting hard copy paper documents to an electronic format. Typically this is done through the use of a scanner, which outputs a TIFF Group 4, JPEG, or even a PDF formatted images. Recognition technology such as OCR (optical character recognition), ICR (intelligent character recognition), and IDR (intelligent document recognition) can be applied to automatically transform the static words or numbers appearing on images into data usable in other electronic applications. Following are

    Following are some elements of capture:

  • Scanners
  • Digital Copiers
  • Image Capture Software
  • Image processing/compression
  • Forms Processing
  • Color Management
  • PDF
  • Storage

Scanners - Scanners are used to convert paper documents to electronic images. There is a niche of scanners known as document scanners designed specifically for document imaging. They typically run faster and are slightly more expensive than flatbed scanners.

Digital copiers - Every day more and more digital copiers are being scan-enabled. Scan-to-e-mail as fax replacement, was the initial killer app for scanning on copiers, but there is now a movement to transition them to more towards distributed capture devices for ECM/workflow applications.

Image Capture Software - Applications that facilitate the conversion of paper documents to electronic images. Originally designed to facilitate batch processing, capture software is now evolving to accommodate smaller volumes in Web capture and front-office applications.

Image processing/compression - Image processing involves that clean-up, cropping, straightening and compression of images. Bi-tonal TIFF Group 4 is still the most popular format for document image compression. However, the emergence of PDF and affordable color scanners have recently created some alternative imaging options. These include mixed raster content (MRC) formats where text and graphics can be separated and each compressed with optimum methods. PDF and JPEG 2000, Part 6 both support MRC compression.

OCR/ICR - These processes involve converting the text on document images to text that can recognized by a computer application. It involves using computer algorithms to read the printed text on a page and converting that text to ASCII, XML, or some other sort of electronically usable language. Once converted, the words on the documents can be used to index the document images or to feed other applications.

IDR - Recently, a concept called intelligent document recognition (IDR) has been introduced and accepted into document imaging applications. IDR involves the automated capture of data from, and classification of, complex documents. It enables a user, for example, to automatically extract relevant data from invoice forms, which, although they contain similar information, can look vastly different.

Forms Processing - Historically, this has been the most lucrative application of OCR/ICR technology. Form processing automates data entry from forms such as HCFAs, UB92s, subscription cards, order forms, and any number of document types. It involves the application of OCR/ICR technology to specific fields to extract data that is automatically entered into a database. It is designed to replace keying operations.

Color management - Color scanners are now available for approximately the same price as black-and-white and grayscale ones. Advanced layering technology such as MRC (mixed raster content) helps create color files that are about the same size as black and white ones.

PDF - PDF has emerged as the format of choice for front-office document imaging PDF Compression, Optimization and Search.

The value proposition for PDF compression, optimization, and search has four main components:

  1. Save storage space. Everyone's available storage fills up eventually. Compressing bi-tonal files can restore virtually 90% of your available storage capacity. Color files compress more than bi-tonal files, so color compression can restore over 98% of your company's existing storage space. Before spending needlessly on more storage space, you could batch compress all your PDFs. It is similarly advisable to convert TIFF and JPEG files to PDF and compress them at the same time. That way, all of your files will be viewable on demand, fully web-optimized and text-searchable. These compressed files will use dramatically less space and the files can always be restored to their native format if necessary.
  2. Efficiently Web-host and email documents. If you are transmitting files over the Internet or Intranet, you'll use 90%-98% less bandwidth by transmitting compressed PDF files. There is a cost savings inherent in using less bandwidth and eliminating overage charges. In addition, you will be able to email corporate files, monthly statements and newsletters without exceeding email attachment file size limitations.
  3. Benefit from increased speed of transmission. PDFs, TIFFs and JPEGs can take a very long time to transmit and receive, even with a high-speed Internet connection. With compressed PDFs, a file that used to take 10 minutes to transmit would take 1 minute. That difference may be hard to measure in a direct dollar for dollar savings equation, but anyone who has ever waited 10 minutes to download a file knows the importance of speed. After all, time is money for the sender as well as the receiver.
  4. Utilize OCR to retrieve files efficiently. Retrieving files quickly and easily is paramount to making business processes efficient. Applying OCR to scanned documents makes these files fully text-searchable and easy to retrieve. Many scanned corporate and litigation files would benefit from full text indexing and searchability. Most of these files are currently just field coded with a very limited field set in order to reduce costs. This limits the usefulness of captured documents since only these fields can be queried. By converting scanned documents from traditional formats such as TIFF to PDF, it becomes very easy to add a "hidden" OCR text layer and support full text search across the entire corporate database.

Storage : Document imaging has been tied to inexorably since its inception. When document imaging was first being used, TIFF Group 4 images were considered storage hogs. Now that storage prices have been greatly reduced, reliability and manageability is the biggest concern of imaging users.

Optical Storage : Optical storage was the original document imaging storage format. Optical disk is renowned for its stability and write-once characteristics.

Magneto-optical : The classic 5.25 in. imaging format, MO is nearing the end of its life cycle. Its latest generation, released in, maxed out its capacity at 9.1 GB per disk. Two formats, including UDO (ultra-density optical) are vying to succeed it.

CD/DVD : CD is very popular format for distributing document images. CD/DVD jukeboxes are also available as a less expensive alternative to MO.

Magnetic Storage : Until recently magnetic storage was considered too expensive to store image files. However, in recent years, lower cost magnetic systems have specifically targeted in the imaging space. WORM (write once read many) capabilities which made optical so popular in imaging applications have now been built into some magnetic solutions.

Tape Storage : Although inexpensive, tape is traditionally considered too slow and unstable for storage of document images. However, higher-grade WORM tape solutions are targeted at the imaging market.

Software : Optical jukebox management software has long been a niche market within the document imaging industry. Now, that software is being used to manage images across a variety of storage devices from magnetic, to optical, to tape. This approach is designed to optimize storage costs and performance.

ILM : ILM is the combination of ECM technology with storage. It is designed to marry the right type of storage with the right type of information. Highly accessed information, for instance, is kept on high-performance magnetic drives. Less accessed archived information, may be kept on a slower device but one with WORM characteristics. Software manages the distribution of the stored information.