https://doi.or g/10.31449/inf.v47i6.4395 Informatica 47 (2023) 191–202 191
Baseline T ransliteration Corpus for Impr oved English-Amharic Machine
T ranslation
Y ohannes Biadgligne
1
, Kamel Smaili
2
1
Sudan University of Science and T echnology (SUST) and Bahir Dar Institute of T echnology (BIT), Khartoum and Bahir
Dar
2
Loria - Université Lorraine, France
E-mail: yohannesb2001@gmail.com, kamel.smaili@loria.fr
Keywords: machine translation, amharic language, transliteration, augmentation, NMT , RNN, GRU, transformer
Received: September 19, 2022
Machine translation (MT) between English and Amharic is one of the least studied and, performance-wise,
least successful topics in the MT field. W e ther efor e pr opose to apply corpus transliteration and augmenta-
tion techniques in this study to addr ess this issue and impr ove MT performance for the language pairs. This
paper pr esents the cr eation, the augmentation, and the use of an Amharic to English transliteration corpus
for NMT experiments. The cr eated corpus has a total of 450,608 parallel sentences befor e pr epr ocessing
and is used to train thr ee differ ent NMT ar chitectur es after pr epr ocessing. These models ar e actually built
using Recurr ent Neural Networks with attention mechanism (RNN), Gated Recurr ent Units (GRUs), and
T ransformers. Specifically , for T ransformer -based experiments, thr ee differ ent T ransformer models with
differ ent hyperparameters ar e cr eated. Compar ed to pr evious works, the BLEU scor e r esults of all NMT
models used in this study ar e impr oved. One of the thr ee T ransformer models, in particular , achieves the
highest BLEU scor e ever r ecor ded for the language pairs.
Povzetek: Raziskava se ukvarja z izboljšanjem str ojnega pr evajanja (MT) med angleščino in amharščino,
eno izmed najmanj pr oučevanih in uspešnih podr očij v MT . Pr edlagana je uporaba tehnike transliteracije
in razširjanja korpusov . Izdelali so korpus za pr eizkušanje NMT , ki obsega 450,608 paralelnih stavkov .
1 Intr oduction
In today’ s modern age of technology and social media, it
is increasingly common to incorporate foreign words into
one’ s native tongue and compose in one language using
scripts from other languages. English is the most widely
used language in this regard [ 1 ]. This can be attributed
to many reasons, but one of them is the prevalence of
the ’QWER TY’ keyboard layout in laptops, smartphones,
and even mechanical typewriters, especially in developing
countries. Thus, many people who don’ t speak English pre-
fer to compose their ideas using English scripts across mul-
tiple messaging platforms. This writing method is known
as transliteration [ 2 , 3 ].
In the 1990s, NLP researchers were interested in creat-
ing machines for transliteration purpose to support other re-
search areas. This was the first time the concept of machine
transliteration was introduced. Machine transliteration is
a subfield of MT and cross-language information retrieval
(CLIR). Its primary goal is to use computers to convert a
text from one language script (the source language) to an-
other language script (the tar get language) while maintain-
ing as much pronunciation as possible. In technical terms,
it is concerned with accurately representing the graphemes
of one language script using the script of another language
[ 4 ].
The literature on MT suggests that transliteration can be
used with MT systems to reduce translation errors and im-
prove precision when translating names (named entities),
technical terms, and loan (borrowed) words [ 5 , 6 , 7 ]. Partic-
ularly for languages with limited resources (e.g small bilin-
gual corpora), such as Amharic. Because learning all of the
words of a given language from a small amount of bilingual
training data is impossible [ 8 , 9 , 10 ]. Finch et al.[ 1 1 ], car -
ried out a lar ge-scale real-world evaluation of the use of au-
tomatic transliteration in an MT system and demonstrated
that using a transliteration system can improve MT qual-
ity when translating unknown words. As a result, machine
transliteration has become a promising application for the
use of MT . T able 1 shows the distinction between transla-
tion and transliteration for the languages under considera-
tion (Amharic and English).
T able 1: Example of Amharic to English translation and
transliteration
T ransliteration
Amharic T ranslation
  t  ˝  Ethiopia ītiyop’iya
  f ¿ Africa āfirīka
ws  „t is in wisit’i nati
192 Informatica 47 (2023) 191–202 Y . Biadgligne et al.
Amharic (   r {/@m@r1gn@ ), the main language of
Ethiopia, has its own scripts and is the second most widely
spoken Semitic language after Arabic. The Amharic script
was originally derived from Ge’ez (g  z /g@’@zz@ ). Al-
though it has disappeared as a colloquial language, Gee’z
is the main language used for prayer , ritual performance,
and the main teaching language in the Ethiopian Ortho-
dox Church [ 12 ]. Amharic uses a slightly modified version
of the Gee’z alphabet. It consists of 34 basic characters,
each of which has seven forms depending on which vow-
els in syllables are pronounced. Even though it is no longer
widely used, Amharic also inherits all the Gee’z numeric
character sets [ 13 ].
2 Related work
Machine transliteration is rarely an end goal by itself, but
is often used as part of other NLP tasks (such as CLIR,
QA, or MT). In light of its importance in these fields, a
number of transliteration mechanisms have been proposed
for non-English languages including Russian, Chinese, Ko-
rean, Arabic, Persian, and Indian [ 14 ]. These mechanisms
generally fall into three broad categories: linguistic (rule-
based) approaches; statistical approaches; and deep learn-
ing approaches [ 15 ].
The linguistic approach uses hand-crafted rules based on
pattern matching, which needs a linguistic analysis to for -
mulate rules. This approach requires a thorough under -
standing of the language under consideration. Early at-
tempts used this method to construct baseline translitera-
tion corpora, and it is still used as a starting point to acquire
transliteration corpora for low-resource languages [ 16 ].
Deep and Goyal [ 17 ] have proposed a Punjabi to English
transliteration system that uses a linguistic-based approach.
In the proposed transliteration scheme, a grapheme-based
method is used to model the transliteration problem and
achieves an accuracy of 93.22% when transliterating com-
mon names. A similar transliteration system has been de-
veloped by Goyal and Lehal [ 18 ] by implementing fifty
complex rules. Their system was found to give about 98%
accuracy for transliterating proper names, city names, coun-
try names, subject-related technical terms etc.
V arious transliteration systems were proposed during
the Named Entities W orkshop (NEWs) evaluation cam-
paigns between 2009 and 2018 [ 19 ]. During the campaigns,
transliteration is done from English into various languages
with various writing systems. As a result of this work-
shop, many advances have been made in methodologies for
transliterating proper nouns. There have been several ap-
proaches developed, including grapheme-to-phoneme con-
version [ 20 , 21 ], based on statistics like machine translation
[ 16 , 22 ], as well as neural networks, such as sequence-to-
sequence models and Long-Short-T erm-Memory (LSTM)
[ 23 , 24 , 25 , 26 , 27 ].
The three transliteration approaches discussed previ-
ously can be based on grapheme
1
, phoneme
2
, hybrid, or
correspondence transliteration models.
– Grapheme-based models: directly converts source
language graphemes into tar get language graphemes
without requiring phonetic knowledge of the source
language words.
– Phoneme-based models: uses source language
phonemes as a pivot when producing tar get language
graphemes from source language graphemes.
– Hybrid and Corr espondence-based models: use
both source language graphemes and phonemes.
Generally , statistical and neural network techniques
based on lar ge parallel transliteration corpora work well for
rich-resource languages but low-resource languages do not
have the luxury of such resources. For such languages, rule-
based transliteration is the only viable option [ 16 ].
2.1 Amharic transliteration
In our literature review , we found two cases where Amharic
was studied for transliteration tasks. The first attempt was
made by T adele T edla [ 28 ]. His objective was to develop
a framework to convert ASCII transliterated Amharic text
to the original Amharic text. In the transliteration of three
random test data-sets, the model achieves respectively 97.7,
99.7, and 98.4 percent accuracy . The first set of test data
consists of an ASCII transliterated Amharic word list of
32,482 words. The second set of test data is a transliter -
ated poem with 1277 words, and the third set of data is a
recipe for Injera, a common local food in Ethiopia, with
123 transliterated Amharic words.
Gezmu et al. [ 29 ] is the second attempt at Amharic-
to-English machine transliteration. In their work, they
used machine transliteration as a tool (to facilitate vocab-
ulary sharing) to improve the performance of Amharic-
English MT . Despite claiming to have created an Amharic-
English transliteration corpus for named entities and bor -
rowed words, they did not make it publicly available. Based
on a review of the literature, we believe that our attempt is
the first to create a lar ge Amharic-English transliteration
corpus for the English-Amharic NMT .
3 Motivation
Developing a reliable English-Amharic MT system remains
a challenge. A scarcity of resources and the absence of
well-or ganized MT research projects are the two major
obstacles to overcoming this challenge. Our search re-
veals that the majority , if not all, of the research works
1
A grapheme is a letter or set of letters that represent the sound
(phoneme) of a word.
2
One of the smallest speech units that distinguishes one word from
another is a word.
Baseline T ransliteration Corpus for Improved English-Amharic… Informatica 47 (2023) 191–202 193
on English-Amharic MT are done by independent individ-
uals and are disjointed. The BLEU score results for these
language pairs are, therefore, not indicative of high qual-
ity translation, according to a general interpretation of the
BLEU score. Thus, this study aims to enhance English-
Amharic MT performance by incorporating transliteration
as a tool. T o achieve this goal, we created an Amharic-
English transliteration corpus from previously collected
English-Amharic MT corpus [ 30 , 31 ] and used it for
English-Amharic NMT experiments. This is the first base-
line corpus for these language pairs, which will be made
available to MT and IR researchers.
4 Experimental set-up
4.1 Corpus pr eparation
The objective of this study is to improve the performance
of English-Amharic MT by using a transliterated and aug-
mented corpus. However , the data required for training the
NMT models is not available. As a result, the previously
gathered English-Amharic translation corpus is used to gen-
erate an Amharic-English transliteration corpus. Therefore,
this section is devoted to explaining the methods and tech-
niques used to create this corpus, as well as the NMT ex-
periments performed with it.
4.1.1 Acquisition of the pr eviously collected
translation corpus
The freely available English-Amharic translation corpus
was obtained from the Github Repository
3 4
[ 30 ]. This cor -
pus was compiled from religious, legal, and news domains
and contains 225,304 English-Amharic parallel sentences.
4.1.2 Pr e-transliteration pr epr ocessing
This step is completed before the transliteration process
begins. It is performed on the previously acquired origi-
nal Amharic translation corpus. Normalization of homo-
phone characters, removal of punctuation marks, and con-
version of Amharic to Arabic numerals are all carried out.
After these preprocessing tasks are completed, the corpus
is divided into 25 parts and distributed to data collectors.
These data collectors use Google T ranslate to transliterate
Amharic sentences into English scripts, and then they col-
lect these sentences by copying and pasting them into a text
file.
4.1.3 T ransliterating the acquir ed corpus
For the successful completion of this task two dif ferent
steps are followed.
3
https://github.com/yohannesb/English-Amharic-origional-corpus
4
https://github.com/yohannesb/English-Amharic-Augmented-corpus
1. Performing transliteration: This process was car -
ried out using Google’ s online translation tool
5
. Re-
gardless of its primary goal of translation, Google
T ranslate can generate text transliterations as part of
the translation process if the two languages use dis-
tinct scripts. The main task completed at this stage, as
shown in Figure 1, was transliterating Amharic sen-
tences to English using Google T ranslate and collect-
ing the transliterated sentences.
In order to transliterate and compile a total of 225,304
Amharic sentences, 25 data collectors (computer sci-
ence students) participated. The entire process of
transliterating and normalizing these Amharic sen-
tences takes 60 days, and each data collector has a
daily throughput of 150 sentences. Prior to the translit-
eration task, each data collector was provided with
brief training and guidance to improve the quality and
consistency of the transliteration process.
2. Normalizing the transliteration corpus: After the
transliteration corpus was collected, the next task ac-
complished was corpus normalization. The objective
of this task was to make the transliterations of Amharic
loan words and named entities (NEs) as close as pos-
sible to the spelling of English words, so that they
become useful for MT purposes. T o assist this man-
ual normalization process, true casing is carried out
first using Moses’ built-in true-caser script. Because
Amharic has a Subject-V erb-Object (SVO) grammati-
cal structure and NEs are more likely to appear at the
beginning of a sentence, true casing allowed us to cap-
italize the first letter of the majority of NEs. This
reduces the amount of work required to locate and
correct NEs when they are transliterated dif ferently
than the English version. T able 3 contains examples
of transliterations produced by Google T ranslate and
their normalized forms. The table also includes the
Levenshtein edit distance [ 32 ] computed between the
English translation and the generated and normalized
transliterations. Computing the Levenshtein edit dis-
tance allows us to choose the transliterations closest to
the English translation.
As depicted in the table, all of the dif ferences between
the English translation and the generated translitera-
tions (using Google T ranslate) occur in representing
the sixth form of Amharic characters. For instance, the
name Daniel ( n   l ) is spelled as Dani’ēli by Google
T ranslate. But its correct English spelling (English
translation) is Daniel . This discrepancy occurs at writ-
ing the sixth form of Amharic characters. In the above
example Google T ranslate uses ni and li to represent
(n ) and (l ) respectively . So, to make the transliterated
loan words and named entities in the corpus closer to
the English word these characters are normalized to
( n ) (n ) and ( l ) (l ). This normalization is done for all
5
https://translate.google.com/
194 Informatica 47 (2023) 191–202 Y . Biadgligne et al.
T able 2: Summary of related works
Author , Y ear Language pairs Model, Appr oach used Objective Results
Goyal, V ishal, and
Gurpreet Singh Lehal,
(2009)
Hindi to Punjabi G2P , Rule-based ap-
proach
For MT 98 % accuracy .
Deep, Kamal, and
V ishal Goyal, (201 1)
Punjabi to English G2P , Rule-based ap-
proach
For MT and
CLIR
93.22 % accuracy .
Laurent, Antoine, Paul
Deleglise, and Sylvain
Meignier , (2009)
French to French G2P , SMT approach For ASR sys-
tem
Comparable to the
dictionary look-up
strategy .
Finch, Andrew , and Ei-
ichiro Sumita, (2010)
English to (Thai,
Hindi, T amil, kannada,
Japanese, Bangla)
G2), PBSMT and Joint
Multigram model
For MT and
CLIR
Better performance
from other previous
works.
Y ao, Kaisheng, and Ge-
of frey Zweig, (2015)
US English G2P , Bi-LSTM For MT and Im-
age captioning
Outperforms previous
SOT A models.
Rao, Kanishka, et al. ,
(2015)
US English G2P , LSTM-RNN Not explicitly
mentioned
Improvement over pre-
vious similar works.
Shao, Y an, and Joakim
Nivre, (2016)
English to Chinese and
Chinese to English
Neural Networks (CNN
and RNN),
Not explicitly
mentioned
Achieved competitive
results with SOT A
models at the time.
Thu, Y e Kyaw , et al.,
(2016)
Maynamar to English G2P , PBSMT , CRF , S-
Arrow , JSM
For pronuncia-
tion dictionary
CRF and PBSMT
achieved best results.
T edla, T adele, (2015) ASCII transliterated
Amharic to original
Amharic letter
G2P , key map dictionary Not explicitly
mentioned
Ranges from 97.7% to
99.7% accuracy .
Gezmu, Andar gachew
Mekonnen, Andreas
Nurnber ger , and T es-
faye bayu Bati, (2021)
Amharic to English G2P , Rule-based ap-
proach
For MT Better results than previ-
ous works.
Figure 1: Snapshot taken from Google T ranslate.
sixth form characters of Amharic. The transliteration
character map used in this work is depicted by T able
4. Which is the modified version of the United Na-
tions Romanization Systems for Geographical Names
(BGN/PCGN 1967 System) approved for Amharic to
English transliteration [ 33 ]. Actually , in this standard
the six form of Amharic characters have two optional
representations.
Overall according to the Levenshtein edit distance, the
normalized form of Google transliteration is closer to the
English translation.
4.1.4 Post-transliteration pr epr ocessing
At this stage, cleaning and splitting of the corpus are per -
formed. These two preprocessing techniques make the
transliterated corpus ready for MT training purposes. The
cleaning task removes empty lines from the corpus, avoids
redundant space between characters and words, and cuts
and discards extremely long sentences (sentences with
more than 80 words). As a result, after completing this
task, the total number of sentences in the corpus drops from
225,304 to 218,365.
Finally , for training our MT models, the transliterated
and preprocessed texts are divided into three parts. For the
Baseline T ransliteration Corpus for Improved English-Amharic… Informatica 47 (2023) 191–202 195
T able 3: Example of Amharic to English transliteration using Google translate and normalized form of the transliteration.
Amharic text
English translation
Google transliteration
Normalized form of
Google transliteration
Levenshtein
edit-distance b/n
English and Google )
Levenshtein
edit-distance b/n
English and
normalized form
Levenshtein
edit-distance (English
and Amharic
 n   l daniel dani’ēli danēl 3 2 6
   d mohamed moh ̣ āmedi mohāmed 2 0 7
  y[ ayisha āyisha āysha 1 1 6
 r‡ marta marita marta 1 0 5
¤ l m bethlehem betelihemi betelhem 3 2 9
  t  ˝  ethiopia ītiyop’iya ityop’ya 4 4 8
`mŁˇ r computer komipīyuteri kompīyuter 5 3 8
sake of comparison (to see the ef fects of transliterated data
on the performance of MT models), the same split ration
as the experiments done in [ 31 ] is used. There are 212,1 15
sentences for training, 5000 sentences for validation, and
1250 sentences for testing.
4.2 Augmentation of transliterated corpus
In addition to the transliteration task, corpus augmenta-
tion is performed to increase the size of the transliterated
English-Amharic corpus. Several publications have indi-
cated that corpus augmentation can be an ef fective method
of scaling up corpora, especially for languages with a lim-
ited resource base. Hence, in this work, token-level corpus
augmentation is applied and the augmented corpus is used
as the training dataset for dif ferent NMT models. Among
alternative token level augmentation techniques random in-
sertion, replacement, deletion, and swapping approaches
are selected and implemented. In doing so, seven dif ferent
augmented corpora are generated by varying the values of
(delete probability , replacement probability , and swapping
range). Then Cosine similarity between the original corpus
and the augmented ones is calculated, and the augmented
corpus that preserves approximately 90% of the meaning is
selected [ 31 ].
The augmentation task is done for training, validation,
and test sets to avoid overlapping sentences in each set. By
combining these augmented data sets with the transliter -
ated corpus, 424,230 training, 10,000 validation, and 2500
testing sets are created. Overall, this resulted in 436,730
cleaned, transliterated, and augmented sentences.
4.3 NMT Experiments
In this experiment, three dif ferent NMT models are created
and their performance are evaluated by comparing them to
previous attempts for the language pairs. RNN with at-
tention mechanisms, GRU-based, and T ransformer -based
NMT models were developed, and each model was trained
using a transliterated and augmented corpus.
4.3.1 Attention based RNN model:
An open source toolkit called Open-NMT [ 34 ] [ 35 ] is used
to build this model. Given the corpus is divided into three
parts (training, validation and testing sets) in the prepor -
cessing stage of this experiment, the first task in train-
ing the RNN based model is performing Byte Pair Encod-
ing (BPE). BPE enables NMT model translation on open-
vocabulary by encoding rare and unknown words as se-
quences of sub-word units. This is based on an intuition
that various word classes are translatable via smaller units
than words [ 36 ]. The next step is preprocessing; actually it
computes the vocabularies given the most frequent tokens,
filters too long sentences, and assigns an index to each to-
ken. Finally , RNN based NMT model with attention mech-
anisms is trained with the parameters depicted in T able 5.
Actually , training is the most time consuming task in the
whole process of creating this model. A lar ger batch size is
advantageous for improving training time and quality . As
a result, a lar ge batch size is used in this experiment. The
lar ger the batch size, the greater the ef ficiency (matrix mul-
tiplication with small batch sizes is very inef ficient). Be-
cause a lar ger matrix can more ef fectively utilize GPU cores
and RAM [ 37 ] [ 38 ].
4.3.2 GRU based model:
In comparison to conventional RNN and LSTM, GRUs are
relatively new architectures that are being used in many ma-
chine learning applications. Due to their fewer parameters,
they improve the training time of LSTM and resolve vanish-
ing and exploding gradients, which occur with RNNs [ 39 ].
In order to conduct the GRU-based NMT experiment,
three distinct units (encoder , attention, and decoder) are cre-
196 Informatica 47 (2023) 191–202 Y . Biadgligne et al.
T able 4: Amharic to English transliteration character map.
1
st
Form 2
nd
Form 3
rd
Form 4
th
Form 5
th
Form 6
th
Form 7
th
Form
1   hā   hu   hī   ha   hē h hi   ho
2   le   lu   lī   la   lē l li   lo
3 H hā I hu J hī K ha L hē M hi N ho
4   me   mu   mī   ma   mē m mi   mo
5 P še Q šu R šī S ša T šē U ši V šo
6   re   ru   rī   ra   rē r ri   ro
7   se   su   sī   sa   sē s si   so
8 X she Y shu Z shī [ sha \ shē ] shi ^ sho
9   k’e   k’u   k’ī ¡ k’a ¢ k’ē q k’i £ k’o
10 ⁄ be ¥ bu ƒ bī § ba ¤ bē b bi ' bo
1 1 “ ve « vu ‹ vī › va ﬁ vē v vi ﬂ vo
12   te – tu † tī ‡ ta · tē t ti   to
13 h che i chu j chī k cha l chē m chi n cho
14 p h
￿
ā q h
￿
u r h
￿
ī s h
￿
a t h
￿
ē u h
￿
i v h
￿
o
15 ¶ ne • nu ‚ nī „ na ” nē n ni » no
16 x nye y nyu z nyī { nya | nyē } nyi ~ nyo
17   ‘ā   ‘u   ‘ī   ‘a   ‘ē   ‘i   ‘o
18 … ke ‰ ku   kī ¿ ka   kē k ki ` ko
19   h
￿
e   h
￿
u   h
￿
ī   h
￿
a   h
￿
ē   h
￿
i   h
￿
o
20 ´ we ˆ wu ˜ wī ¯ wa ˘ wē w wi ˙ wo
21 a ‘ā U ‘u I ‘ī A ‘a E ‘ē e ‘i O ‘o
22 ¨ ze   zu ˚ zī ¸ za   zē z zi ˝ zo
23   zhe   zhu   zhī   zha   zhē   zhi   zho
24 ˛ ye ˇ yu — yī   ya   yē y yi   yo
25   de   du   dī   da   dē d di   do
26   je ¡ ju ¢ jī £ ja ⁄ jē ¥ ji ƒ jo
27   ge   gu   gī   ga   gē g gi   go
28   t’e – t’u † t’ī ‡ t’a · t’ē   t’i ¶ t’o
29 ‚ ch’e „ ch’u ” ch’ī » ch’a … ch’ē ‰ ch’i   ch’o
30 ¨ p’e   p’u ˚ p’ī ¸ p’a   p’ē ˝ p’i ˛ p’o
31 — ts’e   ts’u   ts’ī   ts’a   ts’ē   ts’i   ts’o
32   t
￿
s’ e   t
￿
s’u   t
￿
s’ī   t
￿
s’a   t
￿
s’ē   t
￿
s’i   t
￿
s’o
33   fe Æ fu   fī ª fa   fē f fi   fo
34   pe   pu Ł pī Ø pa Œ pē p pi º po
Baseline T ransliteration Corpus for Improved English-Amharic… Informatica 47 (2023) 191–202 197
T able 5: Parameters and values of RNN model
Parameters V alues
T raining set 424,230
V alidation set 10000
T esting set 2500
Hidden units 512
Layers 6
W ord vec size 512
T rain steps 20000
Batch size 4096
Label smoothing 0.1
Attention mechanism Bahdanau
Evaluation Metric BLEU
ated. Each of the encoder and decoder units has three GRU
layers, with a hidden state size of 512. Before the training
begins the tokenizer converts each word to a unique integer
value, which is then converted to word embeddings by the
embedding unit. The embedding layer has a dimension of
128. The entire architecture of our GRU based NMT model
and the detailed training parameters are depicted in Figure
2 and T able 6 respectively .
T able 6: Parameters and values of GRU model
Parameters V alues
T raining Set 424,230
V alidation Set 10000
T esting Set 2,500
Encoder Units 512
Attention mechanism Bahdanau
Decoder Units 512
Embedding size 128
Loss function cross entropy
Optimizer RMSprop
Batch Size 512
Evaluation Metric BLEU
4.3.3 T ransformer based model
T ransformer is architecturally distinct from other NMT
models. Because, it is entirely dependent on the attention
mechanisms. This makes it suitable for capturing the long-
term dependency between words in a given text. In this ex-
periment, T ransformer -based models using the NMT -Keras
toolkit is built. It is a versatile toolkit based on Keras li-
brary for training deep learning NMT models [ 40 ]. For
comparison purposes three dif ferent T ransformer models
are created : T ransformer -Big, T ransformer -Default, and
T ransformer -Best Practice. Hereafter , they are referred
as T ransformer -B, T ransformer -D, and T ransformer -BP , re-
spectively .
These models are trained using dif ferent hyper -parameter
values but the same training, validation, and testing data
sets. T ransformer -B and T ransformers-D are trained
using pre-configured hyper -parameter values, whereas
T ransformer -BP is trained using tuned hyper -parameter val-
ues.
In order to determine the hyper -parameter values for
T ransformer -BP model, several papers that investigates the
ef fect of hyper -parameter values on the translation quality
of NMT models are surveyed. More importantly , the papers
focuses on T ransformer -based models for low-resource lan-
guage translation are critically reviewed. By considering
the size of the corpus the hyper -parameter values are deter -
mined. The parameter values of the three models are s um-
marized in T able 7.
5 Experimental r esults
T able 8 presents all the BLEU score results for the three
models. The BLEU score results indicated in augmented
corpus column are cited from previous works for the pur -
pose of comparison and analysis.
As shown in the table, the BLEU score results of all the
three models are improved due to the utilization of translit-
erated and augmented corpus. Especially , T ransformer -BP
and GRU models benefited slightly more from the translit-
eration corpus than the other models. This is due to the fact
that T ransformer -BP is trained with hyperparameters that
have been adjusted to account for the size of the corpus.
While GRU is inherently uses small number of parameters
to train, making it easier to select more appropriate hyper -
parameter values and achieve better BLEU score results.
On the other hand, the hyper -parameter values for
other T ransformer based models (T ransformer -B and
T ransformer -D) are set for bigger corpus sizes. So, their
performance is lower than all the remaining models. This
makes them the least benefited models of the T ransliterated
corpus.
In general, a T -test (two-tailed) is used to determine
whether or not the BLEU score results obtained by mod-
els trained with the transliterated corpus are statistically
significant. According to the calculation, the t-value
(0.000301279) is smaller than the critical value P (0.05),
thus indicating there is a significant dif ference between the
two BLEU score values. From this, we can conclude that
transliterating the corpus improves the performance of all
three NMT models. Especially , the BLEU sore result of
the T ransformer -BP model is the highest score so far for
English-Amharic MT .
6 Conclusion
Low resource MT is still a work in progress and in its crawl-
ing stage for a variety of reasons. On the contrary , MT re-
search for resource-rich languages goes a long way in the
acquisition of resources and the creation of dif ferent MT
architectures. As a result, dif ferent successful NMT ar -
chitectures are introduced. These includes RNNs, GRUs
198 Informatica 47 (2023) 191–202 Y . Biadgligne et al.
Figure 2: GRU model architecture.
Baseline T ransliteration Corpus for Improved English-Amharic… Informatica 47 (2023) 191–202 199
T able 7: Hyperparameters of our T ransformer model
Hyper -parameters T ransformer -B T ransformer -D T ransformer -BP
T raining set 424,230 424,230 424,230
V alidation set 10000 10000 10000
T esting set 2500 2500 2500
feed-forward dimension 4096 2048 2048
BPE size 40k 37k 30k
attention heads 16 8 4
dropout 0.5 0.1 0.3
layers 7 6 5
label smoothing 0.8 0.1 0.3
enc/dec layerDrop 0.4 0.0/0.0 0/0.1
src/tgt word dropout 0.3 0.0/0.0 0.2/0.2
activation dropout 0.5 0.0 0
batch size 12288 4096 12288
T able 8: Experimental results of the dif ferent NMT models
Model type Corpus used
Augmented corpus (previous
works)
Augmented + T ransliterated corpus
(present work)
RNN .att 35.38 35.76
GRU 37.79 38.22
T ransformer -B 35.62 35.91
T ransformer -D 36.53 36.85
T ransformer -BP 39.21 39.67
and most importantly T ransformer . However , due to re-
source constraints (particularly lack of huge bilingual cor -
pora), most languages in the low resource language cate-
gory , are not benefiting from these successful architectures.
Amharic is one of these languages. So, in this work, we de-
cided to take up this challenge and attempted to improve the
performance of English-Amharic MT using corpus translit-
eration and augmentation.
For that, we created the biggest Amharic - En-
glish transliteration corpus from the previously collected
English-Amharic parallel corpus using Google T ranslate
(for transliteration) and human data collectors (for normal-
ization). In the normalization process, transliterated names
and borrowed words are spelled as closely to their English
translation as possible. After this, token level corpus aug-
mentation technique is applied on the transliterated corpus
in order to artificially increase the size of the corpus. By
doing so we are able to create a corpus (transliterated and
augmented) with a size of 450,608 parallel sentences.
W ith the created data set RNN with attention mechanism,
GRU-based and T ransformer based NMT architectures are
trained. Compared to a previous work in which we used a
corpus augmentation with similar training parameters, all
three models in this study achieve better MT performance.
Especially , the BLEU score achieved by one of the three
models (T ransformer -BP) is the state of the art result (39.67
BLEU) for the language pairs so far as much as our knowl-
edge is concerned. T ransliteration played a part in this.
Generally , this work adds two contributions to the knowl-
edge base of English-Amharic MT research. The first one
is the creation of English-Amharic transliteration and aug-
mentation corpus. The second one is the improvement of
English-Amharic MT performance.
Refer ences
[1] Kirkpatrick, Andy . English as a lingua franca
in ASEAN: A multilingual model. V ol. 1. Hong
Kong University Press, 2010. https://doi.org/10.5790/
hongkong/9789888028795.003.0008
[2] Kramsch, Claire. ”T eaching foreign languages in an
era of globalization: Introduction.” The modern lan-
guage journal 98.1 (2014): 296-31 1. https://doi.org/
10.1111/j.1540- 4781.2014.12057.x
[3] Coulmas, Florian. Sociolinguistics: The study of
speakers’ choices. Cambridge University Press, 2013.
[4] Kumaran, Adimugan, and T obias Kellner . ”A generic
framework for machine transliteration.” Proceedings
of the 30th annual international ACM SIGIR con-
ference on Research and development in informa-
tion retrieval. 2007. https://doi.org/10.1145/1277741.
1277876
200 Informatica 47 (2023) 191–202 Y . Biadgligne et al.
[5] Zhou, Dong, et al. ”T ranslation techniques in cross-
language information retrieval.” ACM Computing
Surveys (CSUR) 45.1 (2012): 1-44.
[6] Alkhatib, Manar , and Khaled Shaalan. ”The key chal-
lenges for Arabic machine translation.” Intelligent
Natural Language Processing: T rends and Applica-
tions. Springer , Cham, 2018. 139-156. https://doi.org/
10.1007/978- 3- 319- 67056- 0_8
[7] Thanh, Thao Phan Thi. Machine translation of proper
names from english and french into vietnamese: an er -
ror analysis and some proposed solutions. Diss. Uni-
versité de Franche-Comté, 2014.
[8] Guzmán, Francisco, et al. ”The flores evalua-
tion datasets for low-resource machine translation:
Nepali-english and sinhala-english.” arXiv preprint
arXiv:1902.01382 (2019).
[9] Aqlan, Fares, et al. ”Improved Arabic–Chinese ma-
chine translation with linguistic input features.” Fu-
ture Internet 1 1.1 (2019): 22. https://doi.org/10.3390/
fi11010022
[10] Harrat, Salima, Karima Meftouh, and Kamel Smaili.
”Machine translation for Arabic dialects (survey).”
Information Processing & Management 56.2 (2019):
262-273. https://doi.org/10.1016/j.ipm.2017.08.003
[1 1] Finch, Andrew , et al. ”A bayesian model of translit-
eration and its human evaluation when integrated into
a machine translation system.” IEICE transactions on
Information and Systems 94.10 (201 1): 1889-1900.
https://doi.org/10.1587/transinf.e94.d.1889
[12] Salawu, Abiodun, and Asemahagn Aseres. ”Lan-
guage policy , ideologies, power and the Ethiopian me-
dia.” Communicatio 41.1 (2015): 71-89. https://doi.
org/10.1080/02500167.2015.1018288
[13] Menuta, Fekede. ”Over -dif ferentiation in Amharic
orthography and attitude towards reform.” The
Ethiopian Journal of Social Sciences and Language
Studies (EJSSLS) 3.1 (2016): 3-32.
[14] Kaur , Kamaljeet, and Parminder Singh. ”Review
of machine transliteration techniques.” International
Journal of Computer Applications 107.20 (2014).
[15] Karimi, Sarvnaz, Falk Scholer , and Andrew T urpin.
”Machine transliteration survey .” ACM Computing
Surveys (CSUR) 43.3 (201 1): 1-46. https://doi.org/
10.1145/1922649.1922654
[16] Le, Ngoc T an, and Fatiha Sadat. ”Low-resource ma-
chine transliteration using recurrent neural networks
of asian languages.” Proceedings of the Seventh
Named Entities W orkshop. 2018. https://doi.org/10.
18653/v1/w18- 2414
[17] Deep, Kamal, and V ishal Goyal. ”Development of
a Punjabi to English transliteration system.” Interna-
tional Journal of Computer Science and Communica-
tion 2.2 (201 1): 521-526.
[18] Goyal, V ishal, and Gurpreet Singh Lehal. ”Hindi-
Punjabi Machine T ransliteration System (For Ma-
chine T ranslation System).” Geor ge Ronchi Founda-
tion Journal, Italy 64.1 (2009): 2009. https://doi.org/
10.4304/jetwi.2.2.148- 151
[19] Duan, Xiangyu, et al. ”Report of NEWS 2016 ma-
chine transliteration shared task.” Proceedings of the
Sixth Named Entity W orkshop. 2016. https://doi.org/
10.18653/v1/w16- 2709
[20] Finch, Andrew , and Eiichiro Sumita. ”T ranslitera-
tion using a phrase-based statistical machine trans-
lation system to re-score the output of a joint multi-
gram model.” Proceedings of the 2010 Named Entities
W orkshop. 2010. https://doi.org/10.3115/1699705.
1699719
[21] Ngo, Hoang Gia, et al. ”Phonology-augmented statis-
tical transliteration for low-resource languages.” Six-
teenth Annual Conference of the International Speech
Communication Association. 2015.
[22] Laurent, Antoine, Paul Deléglise, and Sylvain
Meignier . ”Grapheme to phoneme conversion us-
ing an SMT system.” 10th Annual Conference of
the International Speech Communication Association
2009 (INTERSPEECH 2009). 2009. https://doi.org/
10.21437/interspeech.2009- 243
[23] Finch, Andrew , et al. ”T ar get-bidirectional neural
models for machine transliteration.” Proceedings of
the sixth named entity workshop. 2016. https://doi.
org/10.18653/v1/w16- 2711
[24] Shao, Y an, and Joakim Nivre. ”Applying neural net-
works to English-Chinese named entity translitera-
tion.” Proceedings of the sixth named entity work-
shop. 2016. https://doi.org/10.18653/v1/w16- 2710
[25] Thu, Y e Kyaw , et al. ”Comparison of grapheme-to-
phoneme conversion methods on a myanmar pronun-
ciation dictionary .” Proceedings of the 6th W orkshop
on South and Southeast Asian Natural Language Pro-
cessing (WSSANLP2016). 2016.
[26] Y ao, Kaisheng, and Geof frey Zweig. ”Sequence-
to-sequence neural net models for grapheme-
to-phoneme conversion.” arXiv preprint
arXiv:1506.00196 (2015). https://doi.org/10.21437/
interspeech.2015- 134
[27] Rao, Kanishka, et al. ”Grapheme-to-phoneme con-
version using long short-term memory recurrent
Baseline T ransliteration Corpus for Improved English-Amharic… Informatica 47 (2023) 191–202 201
neural networks.” 2015 IEEE International Con-
ference on Acoustics, Speech and Signal Process-
ing (ICASSP). IEEE, 2015. https://doi.org/10.1109/
icassp.2015.7178767
[28] T edla, T adele. ”amLite: Amharic T ranslitera-
tion Using Key Map D ictionary .” arXiv preprint
arXiv:1509.0481 1 (2015).
[29] Gezmu, Andar gachew Mekonnen, An-
dreas Nürnber ger , and T esfaye Bayu Bati.
”Neural Machine T ranslation for Amharic-
English T ranslation.” ICAAR T (1). 2021.
https://doi.org/10.5220/0010383905260532
[30] Biadgligne, Y ohanens, and Kamel Smaïli. ”Parallel
Corpora Preparation for English-Amharic Machine
T ranslation.” International W ork-Conference on Arti-
ficial Neural Networks. Springer , Cham, 2021. https:
//doi.org/10.1007/978- 3- 030- 85030- 2_37
[31] Biadgligne, Y ohannes, and Kamel Smaïli. ”Of fline
Corpus Augmentation for English-Amharic Machine
T ranslation.” ICICT CPS conference proceedings.
2022. https://doi.org/10.1109/icict55905.2022.00030
[32] Y uliani, S. Y ., et al. ”Hoax news validation using simi-
larity algorithms.” Journal of Physics: Conference Se-
ries. V ol. 1524. No. 1. IOP Publishing, 2020.
[33] United States Board on Geographic Names, and
United States. Defense Mapping Agency . Gazetteer
of Ethiopia: Names Approved by the United States
Board on Geographic Names. Defense Mapping
Agency , 1982. https://doi.org/10.5962/bhl.title.39085
[34] Klein, Guillaume, et al. ”Opennmt: Open-source
toolkit for neural machine translation.” arXiv preprint
arXiv:1701.02810 (2017).
[35] Andrabi, Syed Abdul Basit, and Abdul W ahid. ”A
Comprehensive Study of Machine T ranslation T ools
and Evaluation Metrics.” Inventive Systems and Con-
trol. Springer , Singapore, 2021. 851-865. https://doi.
org/10.1007/978- 981- 16- 1395- 1_62
[36] Sennrich, Rico, Barry Haddow , and Alexandra Birch.
”Neural machine translation of rare words with sub-
word units.” arXiv preprint arXiv:1508.07909 (2015).
https://doi.org/10.18653/v1/p16- 1162
[37] McCandlish, Sam, et al. ”An empirical
model of lar ge-batch training.” arXiv preprint
arXiv:1812.06162 (2018).
[38] Y ao, Zhewei, et al. ”Lar ge batch size training of neu-
ral networks with adversarial training and second-
order information.” arXiv preprint arXiv:1810.01021
(2018).
[39] Alom, Md Zahangir , et al. ”A state-of-the-art survey
on deep learning theory and architectures.” Electron-
ics 8.3 (2019): 292.
[40] Peris, Álvaro, and Francisco Casacuberta.
”NMT -Keras: a very flexible toolkit with a
focus on interactive NMT and online learn-
ing.” arXiv preprint arXiv:1807.03096 (2018).
https://doi.org/10.2478/pralin- 2018- 0010
202 Informatica 47 (2023) 191–202 Y . Biadgligne et al.