A Novel Feature Extraction and Classification of Degraded Historical Documents
Abstract
Text recognition of historical documents poses several challenges due to the degraded quality of documents. Aging or the writing style is the major cause for degration. There is also documents with faded ink, ink stain, uneven space between text lines, overlapping of text lines or characters, damaged characters or pages. To preserve our cultural heritage we work on digitizing historical documents. To recognize the historical document images by applying MSER feature extraction, stroke width filtering and RF classifiers is the main objective. The proposed method filters out noise pixels by exploring thresholding operation to get clean image. Each characters are labelled using connected-component labelling method. After labelling, bounding box technique is performed to segment each character. The non-text regions are removed using the MSER and Stroke Width Image. Segmented character image is resized and HOG feature descriptors is extracted and stored it as base. RF classifier is applied to the feature descriptors to classify and recognize the historical document images. The average RF classification accuracy for different datasets is 96.2%. It is also observed that the proposed method gives good performance than the existing method in terms of Recall, Precision and F-measure. The average values of Recall, Precision and F-measure are 0.966, 0.996 and 0.979.