Machine Learning of Morphological Stemming for Tamil Text Classification

K. Rajan, N. Rajkumar

K. Rajan, N. Rajkumar

Abstract

This paper proposes a supervised machine learning methodology for segmenting the morphological variants of words into stem and suffix for Tamil text classification. This can be consideredas word stemmingproblem where the stem boundaries are identified within the word forms. The morphological segmentation is a significant issue in the field of computational linguistics as it helps in other crucial tasks such as stemming, syntactic parsing, parts-of-speech tagging and machine translation. The problem of identifying the presence of aboundary between stem and suffix is a kind of classification problem.The sequence of characters on left and right context is given as input. The Artificial neural network is generally employed to solve problems of the kind for which there is no promisingand efficient algorithmic solution.As such, in this paper we present the results obtained by the application of artificial neural network with back propagation algorithm for segmentation. For this segmentation process, the words marked with stem boundaries are used. From the boundary marked words 20000 samples are created. 15000 feature vectors are trained and the remaining 5000 vectors are used for testing in 4-fold cross validation method. Each sample is represented as 25-bit feature vector and the boundary information is provided as output. The implementation of supervised learning on morphological segmentation is presented in this paper, which is very useful forTamilword segmentation and stemming in Text classification task.