Error Analysis

The 15 ``B G'' misclassifications (Table 4.2) were carefully examined. The page images did not present substantial degradation, confirming the results of the classifier. Figure 4.2 shows excerpts from some of these images where the quality can be appreciated.

Figure 4.2: Excerpts from B->G misclassified images (magnified)

All but 4 of these 15 ``problematic'' pages contained tables in them. The OCR output for many of these tables were illegible and generally useless. Figures 4.3 and 4.4 show a clean table image and its associated OCR output. Tables pose special problems to OCR devices.

Figure 4.3: Clean table image

The classifier labeled all these 11 pages containing tables ``Good'' because, from an image defects point of view, it could not find enough evidence to give the pages a ``Bad'' label. The contents of these pages, however, suggest that table-generated OCR errors are special and, therefore, are not related to image-generated OCR errors. After evaluating all 11 ``Bad Good'' misclassified ``table pages'', the following observations are in order:

Table Observation 1.: There is a marked difference in OCR performance among different OCR devices when handling pages with tables. Table 4.4 lists the 11 BG misclassified pages. It can be seen that, in general, devices in the left part of the table tend to do much better than those on the right part of the table. Because of this variability, we can no longer assume that a page with median accuracy is a bad page, since the OCR results can be radically different depending upon the OCR device used.
Table Observation 2.: Character recognition measures are not enough for Table-OCR evaluation. When tables are present, an OCR user is not only interested in the contents (i.e., characters) on the table but also in the table's overall layout. This information is critical for tables with empty cells, where if the OCR algorithm does not generate the proper layout, neighboring cell values can be wrongly assigned to these empty cells. A character recognition measure such as the one used at ISRI, is not sophisticated enough to check table formatting and thus not well suited for the kind of accuracy evaluation in consideration. The development of new ways to measure Table-OCR accuracy is beyond the scope of this work, but should be addressed if table-generated OCR output is to be evaluated.
Table Observation 3.: Numeric tables seem to present more difficulty to OCR algorithms than textual tables and normal text. The majority of the errors found in the OCR output for the evaluated tables stemmed from numerical data. Substitutions of ``0'' by ``O'', ``1'' by ``l'' were among the most common. Some OCR devices are strongly biased towards textual information and, therefore, make these kind of errors in purely numeric data. Furthermore, OCR devices cannot use lexicon-based correction for numerical data.

It is important to make clear that none of these three observations are dependent on page quality. Furthermore, after visually inspecting the images, tables that presented poor image quality were in general correctly flagged as ``Bad'' by the classifier and their OCR output was subject to the normal image quality-related errors (in addition to the special problems posed by tables and mentioned above). Similarly, pages labeled as ``Good'' by the classifier were visually inspected and found to indeed have high image quality. Some of these pages were considered ``Bad'' in the experiment because their low OCR accuracy stem from the special characteristics of tables mentioned above and not from image defects. Furthermore, many of the ``BG'' pages would not be misclassified if the devices used in the experiment had been a selected subset of the ones used. This variability in OCR output is not usual in ``normal'' textual pages.

Therefore, a reject region was established to filter-out all the pages containing tables. These pages would need to be processed by another type of classifier since the difficulty they present to OCR algorithms does not result from image quality, but from the complexity of their contents and layout. Table 4.5 shows the confusion matrix with the addition of the Table Reject column. All the pages with tables in them were not processed by the classifier and were assigned to the Table Reject column.

The Error and Reject Rates are now:

After examining the results analyzed by the number of connected components on the page (see Table 4.6), it can clearly be seen that a reject zone for any page with 200 connected components or less has to be implemented.

The rationale behind this decision takes root in that this classifier is based in measured ratios. Pages with a low number of connected components are not ``stable'' enough to present credible ratios, since a little variation can result in a very high (or low) ratio. Therefore, having a cutoff number is a requirement to make the classifier robust.

The results obtained after applying this new filter are shown in Table 4.7, where no ``BG'' misclassifications exist.

These last results produce the following Error / Reject Rates:

In this case, however, there are no ``B G'' misclassifications, which was a desired goal with regards to quality control.

There are a considerable number of ``GB'' misclassifications. After evaluating the misclassified pages as well as the triggered rules that produced the misclassifications, the following considerations are in order:

Rule #1, the White Speckle Factor, seems to be oversensitive and is producing the bulk of the errors (see Table 4.8). On the other hand, the same rule is correctly classifying Bad pages (see Table 4.9). Thus, a quick solution for this rule is not obvious. A complimentary rule or, better yet, an improved way to detect touching characters may be needed. It is to be noted that this rule was not designed to handle small fonts and several of the misclassified pages have fonts with size < 10pt in them. This rule could be regarded as a necessary but not sufficient condition for the existence of fat and/or touching characters.
Rule #3, the Black Size rule, is the second largest cause of errors in the GB misclassifications. Furthermore, it does not uniquely flag a Bad page as such (Table 4.9). Therefore, based on the test dataset results, this rule could be eliminated altogether and the number of misclassifications (of any kind) would be lowered by 5.
The classifier is definitely biased towards filtering out all bad pages. This would need to be revised in order to provide a finer degree of control. This behavior is, however, appropiate in a large scale OCR environment, where it is critical not to let any bad page slip by the filter in order not to incur in error-correction costs.

Next: Higher Good Thresholds Up: Results and Analysis Previous: Unfiltered Results