After conducting an extensive literature survey and consulting with various researchers in the OCR field, no previous work similar to the one presented here could be found. Therefore, alternative approaches to OCR difficulty evaluation were investigated. These works, even though not aimed at finding page quality metrics, are closely related to this project's scope.
Mindy Bokser presents a complete view of the problems associated with trying to recognize letters from a page image [3]. According to her work, touching and broken (split) characters seem to be the most important source of OCR problems. Regarding OCR technology, she acknowledges that
The best products do a good job on clean documents, but they all degrade in performance -some more gracefully than others- as document quality (or scanner quality) degrades [3].
Similarly, Nartker et al [8] identified broken and touching
characters as the leading cause of OCR
errors. Table 2.1 summarizes estimated OCR problems obtained from a
240-page test. Page quality errors account for
83.9%of the total number of errors in the set, whereas errors caused
by other factors account for the rest.
Jenkins and Kanai [5] studied the influence of lexical
factors on OCR performance. They controlled image quality and
typographical features by creating synthetic images and using them as
input to OCR devices. Based on their results, they suggested that
linguistic factors, apart from image-related factors, also affect OCR
performance, since most current OCR products incorporate a system
lexicon to resolve character recognition ambiguity. Along this idea,
the number of stopwords was
identified as a factor in OCR accuracy, since they are more likely to
be included in the system's lexicon than non-stopwords.
An important point to be made is that a high percent of the errors are due to relatively few causes, which ultimately correspond to image quality, whereas a large number of other factors represent a relatively low number of errors. Therefore, methods and solutions for image quality problems can and will have a direct impact on OCR performance from an end-user's point of view, since by fixing page-quality related errors the accuracy rate can increase considerably.