Optimization of Phrase Table in Statistical Machine Translation Using Linguistic Features
Phrase based machine translation is a statistical approach of machine translation that heavily depends on the quality and size of the phrase table. During the training process, phrase table accumulates many useful as well as useless phrase pairs obtained from word aligned parallel corpus. The huge size of phrase table due to useless, spurious and redundant phrase pairs demands more memory and unnecessary processor time. To accommodate the translation model on handheld devices where memory is a scarce resource, some research has been done to minimize the size of the phrase table. However, most of the existing approaches remove the useless phrase pairs using hard rule based translation probability values. The rigid constraints in these methods use only a single probability value factor and do not consider composite factors. This paper proposes a machine learning based classification framework to filter the useless phrases from the phrase table using multiple linguistic and statistical features. Results obtained show that, the proposed framework efficiently removes 67% of the phrase pairs while keeping acceptable quality of translation.