Secured Digital Data Extraction of enterprise invoices images using Fuzzy Matching Techniques

R. Priyadarshini, Swetha, S. B, R. Kiran, Mohammed Azeem.M. A,  N. Rajendran

R. Priyadarshini, Swetha, S. B, R. Kiran, Mohammed Azeem.M. A, N. Rajendran

Abstract

NLP is one of the components of Machine Learning. Topic Modeling is a subcomponent of information retrieval Information Retrieval is a broad area of research in Natural Language Processing (NLP). Image Processing is the extraction of meaningful information mainly from digitally stored images by means of by a number of processing techniques, each however may be useful for a small range of tasks as there still aren’t any known methods of image analysis that are generic enough for wide ranges of tasks, compared to the abilities of a human’s image analyzing capabilities. However, there are some instabilities in retrieving useful information from the Existing system which is involved with a lot of manual operations. Managing and maintaining the existing system was difficult. Like managing the invoice of all the employees which are submitting by the employees for reimbursement and making the information entry manually is tedious job and consumes lot of time. In order to overcome the issues, in this project a Secured Digital Data Extraction of Enterprise Invoice is proposed. To solve the existing discrepancies, the proposed system takes on Optical character recognition and Line deduction in Image enhancement techniques for categorization of invoices. To overcome the instabilities and poor data retrieval rate in existing system, fuzzy matching techniques and key word-based searching is proposed. The Generation and Integration phase will be combined to give the summarized legal text documents that prevent from instability and performance issues. To perform efficient segmentation of data from the enhanced image by using Line detection and Keyword search engine algorithm. Line detection algorithm HOCR Box Method and HOCR Line method uses the lattice point edge detection to extract text from images bounding box-based approach based on white spaces, case sensitive and font type as Sans-serif and Serif.TXT method will extract the text characters from the images to text file as output. By Defining the keywords and training the keywords by using Long Short Term Memory (LSTM)+ Neural network to make it as meta keyword which is used as parameters for fuzzy matching. It uses the RNN (Recurrent Neural Network) to train each time based on the invoice details that has been extracted to the text file. The semantic keyword is the parameters which is given as the fuzzy matching co-ordinates to locate the entire file by reading line by line and check for the best match. It uses Neural network to train each time, based upon the input and text which has been generated from the invoice images to get the exact match of the details to the CSV file. Then the CSV file is converted in to the excel file for detailed view of invoice information. Due to proposed algorithm and its comparison of accuracy – compared to old systems, drawbacks – rectify, the proposed system has increased from 95 – 98% accuracy – recall and precision.