Asset Publisher Asset Publisher

Return to Full Page

Nima Hemmati

Nima Hemmati


Nima Hemmati

master

 

Persian to English Transliteration

Transliteration is a task for converting words in the source language using the approximate phonetic or spelling equivalents into ones in the target language [43]. Transliteration and translation belong to two different categories. For instance the Persian word "کتاب / book/ ketâb" when translated into the English language become "Book". However, "کتاب / book/ ketâb" in the Persian language transliterated to "ketâb" in the English. Machine transliteration has proven to be an important and useful research area in the field of natural language processing (NLP). One of the main uses of transliteration schemes is in Machine Translations (MT). Despite the large amount of data available for them, the MT systems are still suffering from the presence of the out-of-vocabulary (OOV) words. OOV words are mostly Named Entities such as people, company and place names, technical terms and foreign words usually that do not appear in the dictionaries. Transcribing the source languages and using it directly in the target language, is a solution for OOV words. In addition to the MT systems, there are many challenging tasks in NLP to be solved with machine transliteration: 1) Cross-Lingual Information Retrieval. 2) Real-time translation for emails, blogs, etc. 3) Multilingual chat applications. 4) Cross-Lingual Question Answering Systems. 5) Text to speech (TTS) systems. In this thesis, we induce a statistical machine translation (SMT) based model from Persian-English parallel corpus to mitigate the OOV words problem. In the first step, we used a simple phrase based SMT model. Analyzing the errors indicated that the Ezafe markers in the Persian texts have a considerable contribution to the errors. Therefore, we used a CRF method for Ezafe recognition system to determine Ezafe markers in Persian text. Then, to deal with the OOV words challenge we trained grapheme-to-phoneme (G2P) conversion and word lattice models which were integrated into the SMT system. A Persian word can be written in different forms using English alphabet. We should consider a standard writing system to avoid possible ambiguities. So, we used Dabire, a romanized transcription scheme and created our Persian-Dabire parallel corpus. In addition, the final performance on the test corpus shows that our system achieves comparable results with other state-of-the-art systems. Keywords: Transliteration, Machine Transliteration, Dabire, Statistical Machine Translation, Grapheme to Phoneme, Word Lattice