The 15 ``B G'' misclassifications (Table 4.2)
were carefully examined. The page images did not present substantial
degradation, confirming the results of the
classifier. Figure 4.2 shows excerpts from some of these
images where the quality can be appreciated.
Figure 4.2: Excerpts from B->G misclassified images (magnified)
All but 4 of these 15 ``problematic'' pages contained tables in them. The OCR output for many of these tables were illegible and generally useless. Figures 4.3 and 4.4 show a clean table image and its associated OCR output. Tables pose special problems to OCR devices.
Figure 4.3: Clean table image
The classifier labeled all these 11 pages containing tables ``Good''
because, from an image defects point of view, it could not find enough
evidence to give the pages a ``Bad'' label. The contents of these
pages, however, suggest that table-generated OCR errors are special
and, therefore, are not related to image-generated OCR errors. After
evaluating all 11 ``Bad Good'' misclassified ``table
pages'', the following observations are in order:
It is important to make clear that none of these three observations
are dependent on page quality. Furthermore, after visually inspecting
the images, tables that presented poor image quality were in general
correctly flagged as ``Bad'' by the classifier and their OCR output
was subject to the normal image quality-related errors (in addition to
the special problems posed by tables and mentioned above). Similarly,
pages labeled as ``Good'' by the classifier were visually inspected
and found to indeed have high image quality. Some of these pages were
considered ``Bad'' in the experiment because their low OCR accuracy
stem from the special characteristics of tables mentioned above and
not from image defects. Furthermore, many of the ``BG''
pages would not be misclassified if the devices used in the experiment
had been a selected subset of the ones used. This variability in OCR
output is not usual in ``normal'' textual pages.
Therefore, a reject region was established to filter-out all the pages containing tables. These pages would need to be processed by another type of classifier since the difficulty they present to OCR algorithms does not result from image quality, but from the complexity of their contents and layout. Table 4.5 shows the confusion matrix with the addition of the Table Reject column. All the pages with tables in them were not processed by the classifier and were assigned to the Table Reject column.
The Error and Reject Rates are now:
After examining the results analyzed by the number of connected components on the page (see Table 4.6), it can clearly be seen that a reject zone for any page with 200 connected components or less has to be implemented.
The rationale behind this decision takes root in that this classifier is based in measured ratios. Pages with a low number of connected components are not ``stable'' enough to present credible ratios, since a little variation can result in a very high (or low) ratio. Therefore, having a cutoff number is a requirement to make the classifier robust.
The results obtained after applying this new filter are shown in
Table 4.7, where no ``BG''
misclassifications exist.
These last results produce the following Error / Reject Rates:
In this case, however, there are no ``B G''
misclassifications, which was a desired goal with regards to quality
control.
There are a considerable number of ``GB''
misclassifications. After evaluating the misclassified pages as well
as the triggered rules that produced the misclassifications, the
following considerations are in order: