https://doi.or g/10.31449/inf.v47i6.4395 Informatica 47 (2023) 191–202 191 Baseline T ransliteration Corpus for Impr oved English-Amharic Machine T ranslation Y ohannes Biadgligne 1 , Kamel Smaili 2 1 Sudan University of Science and T echnology (SUST) and Bahir Dar Institute of T echnology (BIT), Khartoum and Bahir Dar 2 Loria - Université Lorraine, France E-mail: yohannesb2001@gmail.com, kamel.smaili@loria.fr Keywords: machine translation, amharic language, transliteration, augmentation, NMT , RNN, GRU, transformer Received: September 19, 2022 Machine translation (MT) between English and Amharic is one of the least studied and, performance-wise, least successful topics in the MT field. W e ther efor e pr opose to apply corpus transliteration and augmenta- tion techniques in this study to addr ess this issue and impr ove MT performance for the language pairs. This paper pr esents the cr eation, the augmentation, and the use of an Amharic to English transliteration corpus for NMT experiments. The cr eated corpus has a total of 450,608 parallel sentences befor e pr epr ocessing and is used to train thr ee differ ent NMT ar chitectur es after pr epr ocessing. These models ar e actually built using Recurr ent Neural Networks with attention mechanism (RNN), Gated Recurr ent Units (GRUs), and T ransformers. Specifically , for T ransformer -based experiments, thr ee differ ent T ransformer models with differ ent hyperparameters ar e cr eated. Compar ed to pr evious works, the BLEU scor e r esults of all NMT models used in this study ar e impr oved. One of the thr ee T ransformer models, in particular , achieves the highest BLEU scor e ever r ecor ded for the language pairs. Povzetek: Raziskava se ukvarja z izboljšanjem str ojnega pr evajanja (MT) med angleščino in amharščino, eno izmed najmanj pr oučevanih in uspešnih podr očij v MT . Pr edlagana je uporaba tehnike transliteracije in razširjanja korpusov . Izdelali so korpus za pr eizkušanje NMT , ki obsega 450,608 paralelnih stavkov . 1 Intr oduction In today’ s modern age of technology and social media, it is increasingly common to incorporate foreign words into one’ s native tongue and compose in one language using scripts from other languages. English is the most widely used language in this regard [ 1 ]. This can be attributed to many reasons, but one of them is the prevalence of the ’QWER TY’ keyboard layout in laptops, smartphones, and even mechanical typewriters, especially in developing countries. Thus, many people who don’ t speak English pre- fer to compose their ideas using English scripts across mul- tiple messaging platforms. This writing method is known as transliteration [ 2 , 3 ]. In the 1990s, NLP researchers were interested in creat- ing machines for transliteration purpose to support other re- search areas. This was the first time the concept of machine transliteration was introduced. Machine transliteration is a subfield of MT and cross-language information retrieval (CLIR). Its primary goal is to use computers to convert a text from one language script (the source language) to an- other language script (the tar get language) while maintain- ing as much pronunciation as possible. In technical terms, it is concerned with accurately representing the graphemes of one language script using the script of another language [ 4 ]. The literature on MT suggests that transliteration can be used with MT systems to reduce translation errors and im- prove precision when translating names (named entities), technical terms, and loan (borrowed) words [ 5 , 6 , 7 ]. Partic- ularly for languages with limited resources (e.g small bilin- gual corpora), such as Amharic. Because learning all of the words of a given language from a small amount of bilingual training data is impossible [ 8 , 9 , 10 ]. Finch et al.[ 1 1 ], car - ried out a lar ge-scale real-world evaluation of the use of au- tomatic transliteration in an MT system and demonstrated that using a transliteration system can improve MT qual- ity when translating unknown words. As a result, machine transliteration has become a promising application for the use of MT . T able 1 shows the distinction between transla- tion and transliteration for the languages under considera- tion (Amharic and English). T able 1: Example of Amharic to English translation and transliteration T ransliteration Amharic T ranslation t ˝ Ethiopia ītiyop’iya f¿ Africa āfirīka ws „t is in wisit’i nati 192 Informatica 47 (2023) 191–202 Y . Biadgligne et al. Amharic ( r {/@m@r1gn@ ), the main language of Ethiopia, has its own scripts and is the second most widely spoken Semitic language after Arabic. The Amharic script was originally derived from Ge’ez (g z /g@’@zz@ ). Al- though it has disappeared as a colloquial language, Gee’z is the main language used for prayer , ritual performance, and the main teaching language in the Ethiopian Ortho- dox Church [ 12 ]. Amharic uses a slightly modified version of the Gee’z alphabet. It consists of 34 basic characters, each of which has seven forms depending on which vow- els in syllables are pronounced. Even though it is no longer widely used, Amharic also inherits all the Gee’z numeric character sets [ 13 ]. 2 Related work Machine transliteration is rarely an end goal by itself, but is often used as part of other NLP tasks (such as CLIR, QA, or MT). In light of its importance in these fields, a number of transliteration mechanisms have been proposed for non-English languages including Russian, Chinese, Ko- rean, Arabic, Persian, and Indian [ 14 ]. These mechanisms generally fall into three broad categories: linguistic (rule- based) approaches; statistical approaches; and deep learn- ing approaches [ 15 ]. The linguistic approach uses hand-crafted rules based on pattern matching, which needs a linguistic analysis to for - mulate rules. This approach requires a thorough under - standing of the language under consideration. Early at- tempts used this method to construct baseline translitera- tion corpora, and it is still used as a starting point to acquire transliteration corpora for low-resource languages [ 16 ]. Deep and Goyal [ 17 ] have proposed a Punjabi to English transliteration system that uses a linguistic-based approach. In the proposed transliteration scheme, a grapheme-based method is used to model the transliteration problem and achieves an accuracy of 93.22% when transliterating com- mon names. A similar transliteration system has been de- veloped by Goyal and Lehal [ 18 ] by implementing fifty complex rules. Their system was found to give about 98% accuracy for transliterating proper names, city names, coun- try names, subject-related technical terms etc. V arious transliteration systems were proposed during the Named Entities W orkshop (NEWs) evaluation cam- paigns between 2009 and 2018 [ 19 ]. During the campaigns, transliteration is done from English into various languages with various writing systems. As a result of this work- shop, many advances have been made in methodologies for transliterating proper nouns. There have been several ap- proaches developed, including grapheme-to-phoneme con- version [ 20 , 21 ], based on statistics like machine translation [ 16 , 22 ], as well as neural networks, such as sequence-to- sequence models and Long-Short-T erm-Memory (LSTM) [ 23 , 24 , 25 , 26 , 27 ]. The three transliteration approaches discussed previ- ously can be based on grapheme 1 , phoneme 2 , hybrid, or correspondence transliteration models. – Grapheme-based models: directly converts source language graphemes into tar get language graphemes without requiring phonetic knowledge of the source language words. – Phoneme-based models: uses source language phonemes as a pivot when producing tar get language graphemes from source language graphemes. – Hybrid and Corr espondence-based models: use both source language graphemes and phonemes. Generally , statistical and neural network techniques based on lar ge parallel transliteration corpora work well for rich-resource languages but low-resource languages do not have the luxury of such resources. For such languages, rule- based transliteration is the only viable option [ 16 ]. 2.1 Amharic transliteration In our literature review , we found two cases where Amharic was studied for transliteration tasks. The first attempt was made by T adele T edla [ 28 ]. His objective was to develop a framework to convert ASCII transliterated Amharic text to the original Amharic text. In the transliteration of three random test data-sets, the model achieves respectively 97.7, 99.7, and 98.4 percent accuracy . The first set of test data consists of an ASCII transliterated Amharic word list of 32,482 words. The second set of test data is a transliter - ated poem with 1277 words, and the third set of data is a recipe for Injera, a common local food in Ethiopia, with 123 transliterated Amharic words. Gezmu et al. [ 29 ] is the second attempt at Amharic- to-English machine transliteration. In their work, they used machine transliteration as a tool (to facilitate vocab- ulary sharing) to improve the performance of Amharic- English MT . Despite claiming to have created an Amharic- English transliteration corpus for named entities and bor - rowed words, they did not make it publicly available. Based on a review of the literature, we believe that our attempt is the first to create a lar ge Amharic-English transliteration corpus for the English-Amharic NMT . 3 Motivation Developing a reliable English-Amharic MT system remains a challenge. A scarcity of resources and the absence of well-or ganized MT research projects are the two major obstacles to overcoming this challenge. Our search re- veals that the majority , if not all, of the research works 1 A grapheme is a letter or set of letters that represent the sound (phoneme) of a word. 2 One of the smallest speech units that distinguishes one word from another is a word. Baseline T ransliteration Corpus for Improved English-Amharic… Informatica 47 (2023) 191–202 193 on English-Amharic MT are done by independent individ- uals and are disjointed. The BLEU score results for these language pairs are, therefore, not indicative of high qual- ity translation, according to a general interpretation of the BLEU score. Thus, this study aims to enhance English- Amharic MT performance by incorporating transliteration as a tool. T o achieve this goal, we created an Amharic- English transliteration corpus from previously collected English-Amharic MT corpus [ 30 , 31 ] and used it for English-Amharic NMT experiments. This is the first base- line corpus for these language pairs, which will be made available to MT and IR researchers. 4 Experimental set-up 4.1 Corpus pr eparation The objective of this study is to improve the performance of English-Amharic MT by using a transliterated and aug- mented corpus. However , the data required for training the NMT models is not available. As a result, the previously gathered English-Amharic translation corpus is used to gen- erate an Amharic-English transliteration corpus. Therefore, this section is devoted to explaining the methods and tech- niques used to create this corpus, as well as the NMT ex- periments performed with it. 4.1.1 Acquisition of the pr eviously collected translation corpus The freely available English-Amharic translation corpus was obtained from the Github Repository 3 4 [ 30 ]. This cor - pus was compiled from religious, legal, and news domains and contains 225,304 English-Amharic parallel sentences. 4.1.2 Pr e-transliteration pr epr ocessing This step is completed before the transliteration process begins. It is performed on the previously acquired origi- nal Amharic translation corpus. Normalization of homo- phone characters, removal of punctuation marks, and con- version of Amharic to Arabic numerals are all carried out. After these preprocessing tasks are completed, the corpus is divided into 25 parts and distributed to data collectors. These data collectors use Google T ranslate to transliterate Amharic sentences into English scripts, and then they col- lect these sentences by copying and pasting them into a text file. 4.1.3 T ransliterating the acquir ed corpus For the successful completion of this task two dif ferent steps are followed. 3 https://github.com/yohannesb/English-Amharic-origional-corpus 4 https://github.com/yohannesb/English-Amharic-Augmented-corpus 1. Performing transliteration: This process was car - ried out using Google’ s online translation tool 5 . Re- gardless of its primary goal of translation, Google T ranslate can generate text transliterations as part of the translation process if the two languages use dis- tinct scripts. The main task completed at this stage, as shown in Figure 1, was transliterating Amharic sen- tences to English using Google T ranslate and collect- ing the transliterated sentences. In order to transliterate and compile a total of 225,304 Amharic sentences, 25 data collectors (computer sci- ence students) participated. The entire process of transliterating and normalizing these Amharic sen- tences takes 60 days, and each data collector has a daily throughput of 150 sentences. Prior to the translit- eration task, each data collector was provided with brief training and guidance to improve the quality and consistency of the transliteration process. 2. Normalizing the transliteration corpus: After the transliteration corpus was collected, the next task ac- complished was corpus normalization. The objective of this task was to make the transliterations of Amharic loan words and named entities (NEs) as close as pos- sible to the spelling of English words, so that they become useful for MT purposes. T o assist this man- ual normalization process, true casing is carried out first using Moses’ built-in true-caser script. Because Amharic has a Subject-V erb-Object (SVO) grammati- cal structure and NEs are more likely to appear at the beginning of a sentence, true casing allowed us to cap- italize the first letter of the majority of NEs. This reduces the amount of work required to locate and correct NEs when they are transliterated dif ferently than the English version. T able 3 contains examples of transliterations produced by Google T ranslate and their normalized forms. The table also includes the Levenshtein edit distance [ 32 ] computed between the English translation and the generated and normalized transliterations. Computing the Levenshtein edit dis- tance allows us to choose the transliterations closest to the English translation. As depicted in the table, all of the dif ferences between the English translation and the generated translitera- tions (using Google T ranslate) occur in representing the sixth form of Amharic characters. For instance, the name Daniel (n l ) is spelled as Dani’ēli by Google T ranslate. But its correct English spelling (English translation) is Daniel . This discrepancy occurs at writ- ing the sixth form of Amharic characters. In the above example Google T ranslate uses ni and li to represent (n ) and (l ) respectively . So, to make the transliterated loan words and named entities in the corpus closer to the English word these characters are normalized to ( n ) (n ) and ( l ) (l ). This normalization is done for all 5 https://translate.google.com/ 194 Informatica 47 (2023) 191–202 Y . Biadgligne et al. T able 2: Summary of related works Author , Y ear Language pairs Model, Appr oach used Objective Results Goyal, V ishal, and Gurpreet Singh Lehal, (2009) Hindi to Punjabi G2P , Rule-based ap- proach For MT 98 % accuracy . Deep, Kamal, and V ishal Goyal, (201 1) Punjabi to English G2P , Rule-based ap- proach For MT and CLIR 93.22 % accuracy . Laurent, Antoine, Paul Deleglise, and Sylvain Meignier , (2009) French to French G2P , SMT approach For ASR sys- tem Comparable to the dictionary look-up strategy . Finch, Andrew , and Ei- ichiro Sumita, (2010) English to (Thai, Hindi, T amil, kannada, Japanese, Bangla) G2), PBSMT and Joint Multigram model For MT and CLIR Better performance from other previous works. Y ao, Kaisheng, and Ge- of frey Zweig, (2015) US English G2P , Bi-LSTM For MT and Im- age captioning Outperforms previous SOT A models. Rao, Kanishka, et al. , (2015) US English G2P , LSTM-RNN Not explicitly mentioned Improvement over pre- vious similar works. Shao, Y an, and Joakim Nivre, (2016) English to Chinese and Chinese to English Neural Networks (CNN and RNN), Not explicitly mentioned Achieved competitive results with SOT A models at the time. Thu, Y e Kyaw , et al., (2016) Maynamar to English G2P , PBSMT , CRF , S- Arrow , JSM For pronuncia- tion dictionary CRF and PBSMT achieved best results. T edla, T adele, (2015) ASCII transliterated Amharic to original Amharic letter G2P , key map dictionary Not explicitly mentioned Ranges from 97.7% to 99.7% accuracy . Gezmu, Andar gachew Mekonnen, Andreas Nurnber ger , and T es- faye bayu Bati, (2021) Amharic to English G2P , Rule-based ap- proach For MT Better results than previ- ous works. Figure 1: Snapshot taken from Google T ranslate. sixth form characters of Amharic. The transliteration character map used in this work is depicted by T able 4. Which is the modified version of the United Na- tions Romanization Systems for Geographical Names (BGN/PCGN 1967 System) approved for Amharic to English transliteration [ 33 ]. Actually , in this standard the six form of Amharic characters have two optional representations. Overall according to the Levenshtein edit distance, the normalized form of Google transliteration is closer to the English translation. 4.1.4 Post-transliteration pr epr ocessing At this stage, cleaning and splitting of the corpus are per - formed. These two preprocessing techniques make the transliterated corpus ready for MT training purposes. The cleaning task removes empty lines from the corpus, avoids redundant space between characters and words, and cuts and discards extremely long sentences (sentences with more than 80 words). As a result, after completing this task, the total number of sentences in the corpus drops from 225,304 to 218,365. Finally , for training our MT models, the transliterated and preprocessed texts are divided into three parts. For the Baseline T ransliteration Corpus for Improved English-Amharic… Informatica 47 (2023) 191–202 195 T able 3: Example of Amharic to English transliteration using Google translate and normalized form of the transliteration. Amharic text English translation Google transliteration Normalized form of Google transliteration Levenshtein edit-distance b/n English and Google ) Levenshtein edit-distance b/n English and normalized form Levenshtein edit-distance (English and Amharic n l daniel dani’ēli danēl 3 2 6 d mohamed moh ̣ āmedi mohāmed 2 0 7 y[ ayisha āyisha āysha 1 1 6 r‡ marta marita marta 1 0 5 ¤lm bethlehem betelihemi betelhem 3 2 9 t ˝ ethiopia ītiyop’iya ityop’ya 4 4 8 `mَr computer komipīyuteri kompīyuter 5 3 8 sake of comparison (to see the ef fects of transliterated data on the performance of MT models), the same split ration as the experiments done in [ 31 ] is used. There are 212,1 15 sentences for training, 5000 sentences for validation, and 1250 sentences for testing. 4.2 Augmentation of transliterated corpus In addition to the transliteration task, corpus augmenta- tion is performed to increase the size of the transliterated English-Amharic corpus. Several publications have indi- cated that corpus augmentation can be an ef fective method of scaling up corpora, especially for languages with a lim- ited resource base. Hence, in this work, token-level corpus augmentation is applied and the augmented corpus is used as the training dataset for dif ferent NMT models. Among alternative token level augmentation techniques random in- sertion, replacement, deletion, and swapping approaches are selected and implemented. In doing so, seven dif ferent augmented corpora are generated by varying the values of (delete probability , replacement probability , and swapping range). Then Cosine similarity between the original corpus and the augmented ones is calculated, and the augmented corpus that preserves approximately 90% of the meaning is selected [ 31 ]. The augmentation task is done for training, validation, and test sets to avoid overlapping sentences in each set. By combining these augmented data sets with the transliter - ated corpus, 424,230 training, 10,000 validation, and 2500 testing sets are created. Overall, this resulted in 436,730 cleaned, transliterated, and augmented sentences. 4.3 NMT Experiments In this experiment, three dif ferent NMT models are created and their performance are evaluated by comparing them to previous attempts for the language pairs. RNN with at- tention mechanisms, GRU-based, and T ransformer -based NMT models were developed, and each model was trained using a transliterated and augmented corpus. 4.3.1 Attention based RNN model: An open source toolkit called Open-NMT [ 34 ] [ 35 ] is used to build this model. Given the corpus is divided into three parts (training, validation and testing sets) in the prepor - cessing stage of this experiment, the first task in train- ing the RNN based model is performing Byte Pair Encod- ing (BPE). BPE enables NMT model translation on open- vocabulary by encoding rare and unknown words as se- quences of sub-word units. This is based on an intuition that various word classes are translatable via smaller units than words [ 36 ]. The next step is preprocessing; actually it computes the vocabularies given the most frequent tokens, filters too long sentences, and assigns an index to each to- ken. Finally , RNN based NMT model with attention mech- anisms is trained with the parameters depicted in T able 5. Actually , training is the most time consuming task in the whole process of creating this model. A lar ger batch size is advantageous for improving training time and quality . As a result, a lar ge batch size is used in this experiment. The lar ger the batch size, the greater the ef ficiency (matrix mul- tiplication with small batch sizes is very inef ficient). Be- cause a lar ger matrix can more ef fectively utilize GPU cores and RAM [ 37 ] [ 38 ]. 4.3.2 GRU based model: In comparison to conventional RNN and LSTM, GRUs are relatively new architectures that are being used in many ma- chine learning applications. Due to their fewer parameters, they improve the training time of LSTM and resolve vanish- ing and exploding gradients, which occur with RNNs [ 39 ]. In order to conduct the GRU-based NMT experiment, three distinct units (encoder , attention, and decoder) are cre- 196 Informatica 47 (2023) 191–202 Y . Biadgligne et al. T able 4: Amharic to English transliteration character map. 1 st Form 2 nd Form 3 rd Form 4 th Form 5 th Form 6 th Form 7 th Form 1 hā hu hī ha hē h hi ho 2 le lu lī la lē l li lo 3 H hā I hu J hī K ha L hē M hi N ho 4 me mu mī ma mē m mi mo 5 P še Q šu R šī S ša T šē U ši V šo 6 re ru rī ra rē r ri ro 7 se su sī sa sē s si so 8 X she Y shu Z shī [ sha \ shē ] shi ^ sho 9 k’e k’u k’ī ¡ k’a ¢ k’ē q k’i £ k’o 10 ⁄ be ¥ bu ƒ bī § ba ¤ bē b bi ' bo 1 1 “ ve « vu ‹ vī › va fi vē v vi fl vo 12 te – tu † tī ‡ ta · tē t ti to 13 h che i chu j chī k cha l chē m chi n cho 14 p h ￿ ā q h ￿ u r h ￿ ī s h ￿ a t h ￿ ē u h ￿ i v h ￿ o 15 ¶ ne • nu ‚ nī „ na ” nē n ni » no 16 x nye y nyu z nyī { nya | nyē } nyi ~ nyo 17 ‘ā ‘u ‘ī ‘a ‘ē ‘i ‘o 18 … ke ‰ ku kī ¿ ka kē k ki ` ko 19 h ￿ e h ￿ u h ￿ ī h ￿ a h ￿ ē h ￿ i h ￿ o 20 ´ we ˆ wu ˜ wī ¯ wa ˘ wē w wi ˙ wo 21 a ‘ā U ‘u I ‘ī A ‘a E ‘ē e ‘i O ‘o 22 ¨ ze zu ˚ zī ¸ za zē z zi ˝ zo 23 zhe zhu zhī zha zhē zhi zho 24 ˛ ye ˇ yu — yī ya yē y yi yo 25 de du dī da dē d di do 26 je ¡ ju ¢ jī £ ja ⁄ jē ¥ ji ƒ jo 27 ge gu gī ga gē g gi go 28 t’e – t’u † t’ī ‡ t’a · t’ē t’i ¶ t’o 29 ‚ ch’e „ ch’u ” ch’ī » ch’a … ch’ē ‰ ch’i ch’o 30 ¨ p’e p’u ˚ p’ī ¸ p’a p’ē ˝ p’i ˛ p’o 31 — ts’e ts’u ts’ī ts’a ts’ē ts’i ts’o 32 t ￿ s’ e t ￿ s’u t ￿ s’ī t ￿ s’a t ￿ s’ē t ￿ s’i t ￿ s’o 33 fe Æ fu fī ª fa fē f fi fo 34 pe pu Ł pī Ø pa Œ pē p pi º po Baseline T ransliteration Corpus for Improved English-Amharic… Informatica 47 (2023) 191–202 197 T able 5: Parameters and values of RNN model Parameters V alues T raining set 424,230 V alidation set 10000 T esting set 2500 Hidden units 512 Layers 6 W ord vec size 512 T rain steps 20000 Batch size 4096 Label smoothing 0.1 Attention mechanism Bahdanau Evaluation Metric BLEU ated. Each of the encoder and decoder units has three GRU layers, with a hidden state size of 512. Before the training begins the tokenizer converts each word to a unique integer value, which is then converted to word embeddings by the embedding unit. The embedding layer has a dimension of 128. The entire architecture of our GRU based NMT model and the detailed training parameters are depicted in Figure 2 and T able 6 respectively . T able 6: Parameters and values of GRU model Parameters V alues T raining Set 424,230 V alidation Set 10000 T esting Set 2,500 Encoder Units 512 Attention mechanism Bahdanau Decoder Units 512 Embedding size 128 Loss function cross entropy Optimizer RMSprop Batch Size 512 Evaluation Metric BLEU 4.3.3 T ransformer based model T ransformer is architecturally distinct from other NMT models. Because, it is entirely dependent on the attention mechanisms. This makes it suitable for capturing the long- term dependency between words in a given text. In this ex- periment, T ransformer -based models using the NMT -Keras toolkit is built. It is a versatile toolkit based on Keras li- brary for training deep learning NMT models [ 40 ]. For comparison purposes three dif ferent T ransformer models are created : T ransformer -Big, T ransformer -Default, and T ransformer -Best Practice. Hereafter , they are referred as T ransformer -B, T ransformer -D, and T ransformer -BP , re- spectively . These models are trained using dif ferent hyper -parameter values but the same training, validation, and testing data sets. T ransformer -B and T ransformers-D are trained using pre-configured hyper -parameter values, whereas T ransformer -BP is trained using tuned hyper -parameter val- ues. In order to determine the hyper -parameter values for T ransformer -BP model, several papers that investigates the ef fect of hyper -parameter values on the translation quality of NMT models are surveyed. More importantly , the papers focuses on T ransformer -based models for low-resource lan- guage translation are critically reviewed. By considering the size of the corpus the hyper -parameter values are deter - mined. The parameter values of the three models are s um- marized in T able 7. 5 Experimental r esults T able 8 presents all the BLEU score results for the three models. The BLEU score results indicated in augmented corpus column are cited from previous works for the pur - pose of comparison and analysis. As shown in the table, the BLEU score results of all the three models are improved due to the utilization of translit- erated and augmented corpus. Especially , T ransformer -BP and GRU models benefited slightly more from the translit- eration corpus than the other models. This is due to the fact that T ransformer -BP is trained with hyperparameters that have been adjusted to account for the size of the corpus. While GRU is inherently uses small number of parameters to train, making it easier to select more appropriate hyper - parameter values and achieve better BLEU score results. On the other hand, the hyper -parameter values for other T ransformer based models (T ransformer -B and T ransformer -D) are set for bigger corpus sizes. So, their performance is lower than all the remaining models. This makes them the least benefited models of the T ransliterated corpus. In general, a T -test (two-tailed) is used to determine whether or not the BLEU score results obtained by mod- els trained with the transliterated corpus are statistically significant. According to the calculation, the t-value (0.000301279) is smaller than the critical value P (0.05), thus indicating there is a significant dif ference between the two BLEU score values. From this, we can conclude that transliterating the corpus improves the performance of all three NMT models. Especially , the BLEU sore result of the T ransformer -BP model is the highest score so far for English-Amharic MT . 6 Conclusion Low resource MT is still a work in progress and in its crawl- ing stage for a variety of reasons. On the contrary , MT re- search for resource-rich languages goes a long way in the acquisition of resources and the creation of dif ferent MT architectures. As a result, dif ferent successful NMT ar - chitectures are introduced. These includes RNNs, GRUs 198 Informatica 47 (2023) 191–202 Y . Biadgligne et al. Figure 2: GRU model architecture. Baseline T ransliteration Corpus for Improved English-Amharic… Informatica 47 (2023) 191–202 199 T able 7: Hyperparameters of our T ransformer model Hyper -parameters T ransformer -B T ransformer -D T ransformer -BP T raining set 424,230 424,230 424,230 V alidation set 10000 10000 10000 T esting set 2500 2500 2500 feed-forward dimension 4096 2048 2048 BPE size 40k 37k 30k attention heads 16 8 4 dropout 0.5 0.1 0.3 layers 7 6 5 label smoothing 0.8 0.1 0.3 enc/dec layerDrop 0.4 0.0/0.0 0/0.1 src/tgt word dropout 0.3 0.0/0.0 0.2/0.2 activation dropout 0.5 0.0 0 batch size 12288 4096 12288 T able 8: Experimental results of the dif ferent NMT models Model type Corpus used Augmented corpus (previous works) Augmented + T ransliterated corpus (present work) RNN .att 35.38 35.76 GRU 37.79 38.22 T ransformer -B 35.62 35.91 T ransformer -D 36.53 36.85 T ransformer -BP 39.21 39.67 and most importantly T ransformer . However , due to re- source constraints (particularly lack of huge bilingual cor - pora), most languages in the low resource language cate- gory , are not benefiting from these successful architectures. Amharic is one of these languages. So, in this work, we de- cided to take up this challenge and attempted to improve the performance of English-Amharic MT using corpus translit- eration and augmentation. For that, we created the biggest Amharic - En- glish transliteration corpus from the previously collected English-Amharic parallel corpus using Google T ranslate (for transliteration) and human data collectors (for normal- ization). In the normalization process, transliterated names and borrowed words are spelled as closely to their English translation as possible. After this, token level corpus aug- mentation technique is applied on the transliterated corpus in order to artificially increase the size of the corpus. By doing so we are able to create a corpus (transliterated and augmented) with a size of 450,608 parallel sentences. W ith the created data set RNN with attention mechanism, GRU-based and T ransformer based NMT architectures are trained. Compared to a previous work in which we used a corpus augmentation with similar training parameters, all three models in this study achieve better MT performance. Especially , the BLEU score achieved by one of the three models (T ransformer -BP) is the state of the art result (39.67 BLEU) for the language pairs so far as much as our knowl- edge is concerned. T ransliteration played a part in this. Generally , this work adds two contributions to the knowl- edge base of English-Amharic MT research. The first one is the creation of English-Amharic transliteration and aug- mentation corpus. The second one is the improvement of English-Amharic MT performance. Refer ences [1] Kirkpatrick, Andy . English as a lingua franca in ASEAN: A multilingual model. V ol. 1. Hong Kong University Press, 2010. https://doi.org/10.5790/ hongkong/9789888028795.003.0008 [2] Kramsch, Claire. ”T eaching foreign languages in an era of globalization: Introduction.” The modern lan- guage journal 98.1 (2014): 296-31 1. https://doi.org/ 10.1111/j.1540- 4781.2014.12057.x [3] Coulmas, Florian. Sociolinguistics: The study of speakers’ choices. Cambridge University Press, 2013. [4] Kumaran, Adimugan, and T obias Kellner . ”A generic framework for machine transliteration.” Proceedings of the 30th annual international ACM SIGIR con- ference on Research and development in informa- tion retrieval. 2007. https://doi.org/10.1145/1277741. 1277876 200 Informatica 47 (2023) 191–202 Y . Biadgligne et al. [5] Zhou, Dong, et al. ”T ranslation techniques in cross- language information retrieval.” ACM Computing Surveys (CSUR) 45.1 (2012): 1-44. [6] Alkhatib, Manar , and Khaled Shaalan. ”The key chal- lenges for Arabic machine translation.” Intelligent Natural Language Processing: T rends and Applica- tions. Springer , Cham, 2018. 139-156. https://doi.org/ 10.1007/978- 3- 319- 67056- 0_8 [7] Thanh, Thao Phan Thi. Machine translation of proper names from english and french into vietnamese: an er - ror analysis and some proposed solutions. Diss. Uni- versité de Franche-Comté, 2014. [8] Guzmán, Francisco, et al. ”The flores evalua- tion datasets for low-resource machine translation: Nepali-english and sinhala-english.” arXiv preprint arXiv:1902.01382 (2019). [9] Aqlan, Fares, et al. ”Improved Arabic–Chinese ma- chine translation with linguistic input features.” Fu- ture Internet 1 1.1 (2019): 22. https://doi.org/10.3390/ fi11010022 [10] Harrat, Salima, Karima Meftouh, and Kamel Smaili. ”Machine translation for Arabic dialects (survey).” Information Processing & Management 56.2 (2019): 262-273. https://doi.org/10.1016/j.ipm.2017.08.003 [1 1] Finch, Andrew , et al. ”A bayesian model of translit- eration and its human evaluation when integrated into a machine translation system.” IEICE transactions on Information and Systems 94.10 (201 1): 1889-1900. https://doi.org/10.1587/transinf.e94.d.1889 [12] Salawu, Abiodun, and Asemahagn Aseres. ”Lan- guage policy , ideologies, power and the Ethiopian me- dia.” Communicatio 41.1 (2015): 71-89. https://doi. org/10.1080/02500167.2015.1018288 [13] Menuta, Fekede. ”Over -dif ferentiation in Amharic orthography and attitude towards reform.” The Ethiopian Journal of Social Sciences and Language Studies (EJSSLS) 3.1 (2016): 3-32. [14] Kaur , Kamaljeet, and Parminder Singh. ”Review of machine transliteration techniques.” International Journal of Computer Applications 107.20 (2014). [15] Karimi, Sarvnaz, Falk Scholer , and Andrew T urpin. ”Machine transliteration survey .” ACM Computing Surveys (CSUR) 43.3 (201 1): 1-46. https://doi.org/ 10.1145/1922649.1922654 [16] Le, Ngoc T an, and Fatiha Sadat. ”Low-resource ma- chine transliteration using recurrent neural networks of asian languages.” Proceedings of the Seventh Named Entities W orkshop. 2018. https://doi.org/10. 18653/v1/w18- 2414 [17] Deep, Kamal, and V ishal Goyal. ”Development of a Punjabi to English transliteration system.” Interna- tional Journal of Computer Science and Communica- tion 2.2 (201 1): 521-526. [18] Goyal, V ishal, and Gurpreet Singh Lehal. ”Hindi- Punjabi Machine T ransliteration System (For Ma- chine T ranslation System).” Geor ge Ronchi Founda- tion Journal, Italy 64.1 (2009): 2009. https://doi.org/ 10.4304/jetwi.2.2.148- 151 [19] Duan, Xiangyu, et al. ”Report of NEWS 2016 ma- chine transliteration shared task.” Proceedings of the Sixth Named Entity W orkshop. 2016. https://doi.org/ 10.18653/v1/w16- 2709 [20] Finch, Andrew , and Eiichiro Sumita. ”T ranslitera- tion using a phrase-based statistical machine trans- lation system to re-score the output of a joint multi- gram model.” Proceedings of the 2010 Named Entities W orkshop. 2010. https://doi.org/10.3115/1699705. 1699719 [21] Ngo, Hoang Gia, et al. ”Phonology-augmented statis- tical transliteration for low-resource languages.” Six- teenth Annual Conference of the International Speech Communication Association. 2015. [22] Laurent, Antoine, Paul Deléglise, and Sylvain Meignier . ”Grapheme to phoneme conversion us- ing an SMT system.” 10th Annual Conference of the International Speech Communication Association 2009 (INTERSPEECH 2009). 2009. https://doi.org/ 10.21437/interspeech.2009- 243 [23] Finch, Andrew , et al. ”T ar get-bidirectional neural models for machine transliteration.” Proceedings of the sixth named entity workshop. 2016. https://doi. org/10.18653/v1/w16- 2711 [24] Shao, Y an, and Joakim Nivre. ”Applying neural net- works to English-Chinese named entity translitera- tion.” Proceedings of the sixth named entity work- shop. 2016. https://doi.org/10.18653/v1/w16- 2710 [25] Thu, Y e Kyaw , et al. ”Comparison of grapheme-to- phoneme conversion methods on a myanmar pronun- ciation dictionary .” Proceedings of the 6th W orkshop on South and Southeast Asian Natural Language Pro- cessing (WSSANLP2016). 2016. [26] Y ao, Kaisheng, and Geof frey Zweig. ”Sequence- to-sequence neural net models for grapheme- to-phoneme conversion.” arXiv preprint arXiv:1506.00196 (2015). https://doi.org/10.21437/ interspeech.2015- 134 [27] Rao, Kanishka, et al. ”Grapheme-to-phoneme con- version using long short-term memory recurrent Baseline T ransliteration Corpus for Improved English-Amharic… Informatica 47 (2023) 191–202 201 neural networks.” 2015 IEEE International Con- ference on Acoustics, Speech and Signal Process- ing (ICASSP). IEEE, 2015. https://doi.org/10.1109/ icassp.2015.7178767 [28] T edla, T adele. ”amLite: Amharic T ranslitera- tion Using Key Map D ictionary .” arXiv preprint arXiv:1509.0481 1 (2015). [29] Gezmu, Andar gachew Mekonnen, An- dreas Nürnber ger , and T esfaye Bayu Bati. ”Neural Machine T ranslation for Amharic- English T ranslation.” ICAAR T (1). 2021. https://doi.org/10.5220/0010383905260532 [30] Biadgligne, Y ohanens, and Kamel Smaïli. ”Parallel Corpora Preparation for English-Amharic Machine T ranslation.” International W ork-Conference on Arti- ficial Neural Networks. Springer , Cham, 2021. https: //doi.org/10.1007/978- 3- 030- 85030- 2_37 [31] Biadgligne, Y ohannes, and Kamel Smaïli. ”Of fline Corpus Augmentation for English-Amharic Machine T ranslation.” ICICT CPS conference proceedings. 2022. https://doi.org/10.1109/icict55905.2022.00030 [32] Y uliani, S. Y ., et al. ”Hoax news validation using simi- larity algorithms.” Journal of Physics: Conference Se- ries. V ol. 1524. No. 1. IOP Publishing, 2020. [33] United States Board on Geographic Names, and United States. Defense Mapping Agency . Gazetteer of Ethiopia: Names Approved by the United States Board on Geographic Names. Defense Mapping Agency , 1982. https://doi.org/10.5962/bhl.title.39085 [34] Klein, Guillaume, et al. ”Opennmt: Open-source toolkit for neural machine translation.” arXiv preprint arXiv:1701.02810 (2017). [35] Andrabi, Syed Abdul Basit, and Abdul W ahid. ”A Comprehensive Study of Machine T ranslation T ools and Evaluation Metrics.” Inventive Systems and Con- trol. Springer , Singapore, 2021. 851-865. https://doi. org/10.1007/978- 981- 16- 1395- 1_62 [36] Sennrich, Rico, Barry Haddow , and Alexandra Birch. ”Neural machine translation of rare words with sub- word units.” arXiv preprint arXiv:1508.07909 (2015). https://doi.org/10.18653/v1/p16- 1162 [37] McCandlish, Sam, et al. ”An empirical model of lar ge-batch training.” arXiv preprint arXiv:1812.06162 (2018). [38] Y ao, Zhewei, et al. ”Lar ge batch size training of neu- ral networks with adversarial training and second- order information.” arXiv preprint arXiv:1810.01021 (2018). [39] Alom, Md Zahangir , et al. ”A state-of-the-art survey on deep learning theory and architectures.” Electron- ics 8.3 (2019): 292. [40] Peris, Álvaro, and Francisco Casacuberta. ”NMT -Keras: a very flexible toolkit with a focus on interactive NMT and online learn- ing.” arXiv preprint arXiv:1807.03096 (2018). https://doi.org/10.2478/pralin- 2018- 0010 202 Informatica 47 (2023) 191–202 Y . Biadgligne et al.