The printed world and the electronic one are primarily connected through computers running the OCR or Optical Character Recognition software programs. Traditional document imaging methods use a two-dimensional environment of templates and algorithms for recognizing objects and patterns.
Current OCR methods can recognize not only a spectrum of colors, but can also distinguish between the forefronts in a document from its background. They work with low-resolution images that mediums such as cell phone cameras, the internet and faxes provide. For this OCR methods often have to de-skew, de-speckle and use 3-D image correction on the images.
Primarily, OCR software programs use two different methods for optical character recognition. The first is feature extraction and the second is matrix matching. With feature extraction, the OCR software program recognizes shapes using mathematical and statistical techniques for detecting edges, ridges and corners in a text font so that it can identify the letters, sentences and paragraphs.
OCR software programs using feature extraction achieve the best results when the image is clean and straight, has very distinguishable fonts such as Helvetica or Arial, uses dark letters on a white background and has at least 300dpi resolution. In reality, these conditions are not always possible. To allow reading words accurately in less ideal circumstances, OCR techniques have switched to matrix matching.
Matrix matching falls in the category of artificial intelligence. For example, organizations such as law enforcement agencies include matrix matching in the software they use for recognizing images within video feeds. The process combines feature extraction together with similarity measurements.
Similarity measurement utilizes complex algorithms and statistical formulas to compare images relative to others within the same image or within the document. This helps to recognize images within a spectrum of colors even in 3D environments. This technology allows OCR software to recognize crooked images, images with too much background interference and images that need alteration for correct reading and interpretation. Matrix matching techniques are also better at recognizing images at a lower resolution.
Today, several OCR software packages include features that can de-speckle and de-skew the image. They can also change the orientation of the page. A special technique called the 3D correction can straighten images that the camera captured at an angle.
OCR has been traditionally linked with scanning software. The scanning process offers clues that make the OCR results more accurate. However, not all images are available in a hard copy, and a scanner may not be readily available. Sometimes, text to be extracted is available only in a PDF file or some other graphic file downloaded from the Internet. While older PDF files did not allow you to copy text, most of the modern PDF files created today have a cursor mouse pointer. That allows copying the text from the document on to your clipboard.
However, advanced PDF creating software includes features to protect the text in the converted document using a password. If you want to extract text from such protected PDF documents, your OCR software program will ask you for the password.