https://doi.org/10.31449/inf.v44i2.2559 Informatica 44 (2020) 147–165 147
Dialogue Act-Based Expressive Speech Synthesis in Limited Domain for the
Czech Language
Martin Gr˚ uber, Jindˇ rich Matoušek, Zdenˇ ek Hanzlíˇ cek and Daniel Tihelka
University of West Bohemia
Faculty of Applied Sciences
NTIS – New Technologies for the Information Society, Department of Cybernetics
Univerzitní 8, Pilsen, Czech Republic
E-mail: gruber@ntis.zcu.cz
Keywords: speech synthesis, unit selection, HMM, expressivity, dialogue act, limited domain
Received: October 30, 2018
This paper deals with expressive speech synthesis in a dialogue. Dialogue acts – discrete expressive cate-
gories – are used for expressivity description. The aim of the work is to create a procedure for development
of expressive speech synthesis for a dialogue system in a limited domain. The domain is here limited to di-
alogues between a human and a computer on a given topic of reminiscing about personal photographs. To
incorporate expressivity into synthetic speech, modiﬁcations of current algorithms used for neutral speech
synthesis are made. An expressive speech corpus is recorded, annotated using a predeﬁned set of dialogue
acts, and its acoustic analysis is performed. Unit selection and HMM-based methods are used to synthe-
size expressive speech, and an evaluation using listening tests is presented. The listeners asses two basic
aspects of synthetic expressive speech for isolated utterances: speech quality and expressivity perception.
The evaluation is also performed for utterances in a dialogue to asses appropriateness of synthetic expres-
sive speech. It can be concluded that synthetic expressive speech is rated positively even though it is of
worse quality when comparing with the neutral speech synthesis. However, synthetic expressive speech is
able to transmit expressivity to listeners and to improve the naturalness of the synthetic speech.
Povzetek: Razvita je metoda za izrazno govorno sintezo v ˇ cešˇ cini.
1 Introduction
Nowadays, speech synthesis techniques produce high qual-
ity and intelligible speech. However, to use synthetic
speech in dialogue systems (ticket booking [1], informa-
tion on restaurants or hotels [2], ﬂights [3], trains [4] or
weather [5]) or in any other human-computer interactive
systems (virtual computer companions, computer games),
the voice interface should be more friendly to make the
user to feel more involved in the interaction or commu-
nication. Synthetic speech cannot sound completely nat-
ural until it expresses a speaker’s attitude. Thus, expres-
sive (or emotional) speech synthesis is a frequently dis-
cussed topic and has become a concern of many scientists.
Even though some results have already been presented, this
task has not been satisfactorily solved yet. Some papers
which deal with this problem include, but are not limited
to [6, 7, 8, 9, 10, 11, 12].
To reduce the complexity of the general expressive
speech synthesis, the task is usually somehow limited (as
well as limited domain speech synthesis systems are) and
focused on a speciﬁc domain, e.g. expressive football an-
nouncements [13], sport commentaries [14] or dialogue
system in a tourism domain [15]. In this work, we limited
the domain to conversations between seniors and a com-
puter. Personal photographs were chosen as the topic for
these discussions since the work started as a part of a ma-
jor project aiming at developing a virtual senior companion
with an audiovisual interface [16].
Once the speciﬁc limited domain is deﬁned, the task of
expressive speech synthesis becomes more easily solvable.
However, this work tries to propose a general methodology
for designing an expressive speech synthesizer in a limited
domain. Thus, it should be possible to create a synthe-
sizer for various limited domains following the procedure
described herein.
In the ﬁrst phase of our research, becoming acquainted
with the deﬁned domain was the main goal. Thus, an
extensive audiovisual database containing 65 natural di-
alogues between humans (seniors) and a computer (rep-
resented by a 3D virtual avatar) was created using the
Wizard-of-Oz method which was proposed in [17] and used
e.g. in [18, 19]. Afterwards, the dialogues were manually
transcribed so that the text could be used later. The process
of the database recording is described in Section 2.
Next, on the basis of these dialogues (the texts and the
audio recordings), an expressive speech corpus was de-
signed and recorded. The recording of the expressive cor-
pus was performed in the form of a dialogue between a
professional female voice talent and a computer. The di-
148 Informatica 44 (2020) 147–165 M. Gr˚ uber et al.
alogues were designed on the basis of the natural dia-
logues recorded in the previous phase. Thus, the voice
talent (acting as the virtual avatar now) was recording pre-
deﬁned sentences as responses to the seniors’ speech that
the voice talent was listening to. The expressive speech cor-
pus recording process is in more details described in Sec-
tion 3.
To synthesize expressive speech, an expressivity descrip-
tion has to be deﬁned. Many approaches have been sug-
gested in the past. Continuous descriptions using multidi-
mensional space with several axes to determinate “expres-
sivity position” were described e.g. in [20, 21]. Another
option is a discrete division into various groups, for emo-
tions e.g. happiness, sadness, anger, joy, etc. [22]. The dis-
crete description is the most commonly used method and
various sets of expressive categories are used, e.g. dialogue
acts [23, 15], emotion categories [24, 7, 25] or categories
like good news and bad news [8, 26]. Thus, a set of ex-
pressive categories was deﬁned and used to annotate the
expressive speech corpus. The expressive categories used
in our work are presented in Section 4 and annotation of
the expressive speech corpus is described in Section 5.
There are various methods to produce synthetic speech,
the mostly used are unit selection [27], HMM-based meth-
ods [28], DNN-based methods [29] or other methods based
on neural networks [30, 31]. These methods can be cer-
tainly used also for the expressive speech synthesis. In ad-
dition, a method for voice conversion [32] can be taken into
consideration. Although this method is primarily used for
a conversion of source voice to a target voice in the pro-
cess of the speech synthesis, it can be also used to convert
one speaking style to another [33]. DNN-based approaches
then allows e.g. adaptation of an expressive model to a new
speaker [34].
To incorporate expressivity into speech using unit selec-
tion method, the baseline algorithm used e.g. in [35, 36]
was slightly modiﬁed. The main modiﬁcation consists in a
different target cost calculation. A prosodic feature repre-
senting an expressive category is considered in addition to
the current set of features used for the cost calculation. To
get speciﬁc penalties for speech units labelled with an ex-
pressive category different from the requested one, enumer-
ated differences between various expressive categories are
used. To compute the penalties, a penalty matrix based on
perception and acoustic differences is used. The complex
acoustic analysis of the expressive speech corpus along
with the unit selection method modiﬁcations is described
in Section 6.
Even though this work is mainly focused on using the
unit selection method for expressive speech synthesis, a
brief description of preliminary experiments with HMM-
based method is also presented. The HMM-based TTS sys-
tem settings is described in Section 7.
The results and evaluation are presented in Section 8.
The expressivity perception ratio is investigated for natu-
ral speech and for synthetic speech generated by both the
unit selection based TTS system and the HMM-based TTS
system. The synthetic speech quality is also discussed in
that section. As the results of this work are to be used in
a dialogue system, the suitability of produced expressive
synthetic speech is evaluated also directly in dialogues.
2 Natural dialogues
To become acquainted with the limited domain, an exten-
sive audiovisual database
1
of natural dialogues was created
using the Wizard-of-Oz method. This means that each di-
alogue was recorded as a dialogue between a human (se-
nior) and a computer (avatar) which was allegedly con-
trolled only by the human voice. However, the computer
was covertly controlled by human operators from another
room. Thus, the operators were controlling the course of
the dialogue whereas the recorded human subjects thought
they are interacting with an independent system based on
artiﬁcial intelligence. The avatar was using neutral TTS
system ARTIC [35] to speak to the human subjects. The
recording procedure is described in [37] in more details.
2.1 Recording setup
A soundproof recording room has been established for the
recording purposes (the scheme is shown in Figure 1). In
the recording room, the human subject faces an LCD screen
and two speakers. The speech is recorded by two wireless
high-quality head microphones (one for the human subject
and one for the computer avatar), and the video is captured
by three miniDV cameras. A surveillance web-camera was
placed in the room to monitor the situation, especially the
senior’s state. The only contact between a user and the
computer was through speech, there was no keyboard nor
mouse on the table.
Wireless
microphones
LCD
MiniDV
cameras
speaker speaker
Surveillance Web camera
Figure 1: Recording room setup.
A snapshot captured by the miniDV cameras during a
recording session is presented in Figure 2. The cameras
1
The video recordings are not used for the purposes of the expressive
TTS system design. They were just archived and are intended for future
use in audiovisual speech recognition, emotion detection, gesture recog-
nition, etc.
Dialogue Act-Based Expressive Speech Synthesis in. . . Informatica 44 (2020) 147–165 149
were positioned to be able to capture the subject from three
different views to provide data usable in various ways.
Figure 2: Screenshot captured by the miniDV cameras dur-
ing a recording session.
2.2 Recording application description
A snapshot of the screen presented to human subjects is
shown in Figure 3 (“Presenter” interface). On the left up-
per part of the LCD screen, there is visualized 3D model
of a talking head. This model is used as the avatar, the im-
personate companion that should play a role of the partner
in the dialogue. Additionally, on the right upper part, there
is shown a photograph which is currently being discussed.
On the lower half of the screen, there is a place used for dis-
playing subtitles (just in case the synthesized speech is not
intelligible sufﬁciently). The subtitles were displayed only
during the ﬁrst few sessions and then they were switched
off as the generated speech turned out to be understandable
enough.
Figure 3: Snapshot of the WoZ system interface - the user’s
side.
In Figure 4, a screen of the operator’s part of the record-
ing application is shown (“Wizard” interface). The inter-
face provides the human operators with possibilities of dia-
logue ﬂow controlling. The middle part of the screen serves
to display the pre-prepared scenario for a dialogue. Note
that the wizards could select the sentences from the sce-
nario, the assumption on how the dialogue could develop,
by clicking on them. Each sentence of the scenario was
given a number related to the picture displayed on the left.
This enabled the orientation in large pre-prepared scenar-
ios. Under the picture there is a button for displaying the
picture on the “Presenter” screen. Once a sentence is se-
lected by clicking on the list, it appears in the bottom edit
box just above the buttons “SPEAK” and “clear”. The dis-
played sentence can be modiﬁed before pressing “SPEAK”
button and also an arbitrary text can be typed into the edit
box. The right part of the screen is intended for displaying
buttons bearing non-speech acts (smile, laughter, assenta-
tion, hesitation) and quick phrases (Yes. No. It’s nice. Al-
right. Doesn’t matter. Go on; etc.).
Figure 4: Snapshot of the WoZ system interface - the oper-
ator’s side.
2.3 Audiovisual database statistics
Almost all audio recordings are stored using 22kHz sam-
ple rate and 16-bit resolution. The ﬁrst six dialogues were
recorded using 48kHz sample rate, later it was reduced to
the current level according to requirements of the cooperat-
ing team dealing with ASR (automatic speech recognition).
The total number of recorded dialogues is 65. Based on
gender, the set of speakers can be divided into 37 females
and 28 males. Mean age of the speakers is 69.3 years;
this number is almost the same for both male and female
speakers. The oldest person was a female, 86 years old.
The youngest one was also a female, 54 years old. All the
recorded subjects were native Czech speakers; two of them
(1 male and 1 female) spoke a regional Moravian dialect.
This dialect differs from regular Czech language in pronun-
ciation and also a little in vocabulary. Approximately one
half of the subjects stated in the after recording form that
they have a computer at home. Nevertheless, most of them
do not use it very often. Almost all the dialogues were rated
as friendly and smooth. And even more, the users were re-
ally enjoying reminiscing on their photos, no matter that
the partner in the dialogue was an avatar. Duration of each
dialogue was limited to 1 hour, as this was the capacity of
150 Informatica 44 (2020) 147–165 M. Gr˚ uber et al.
tapes used in miniDV cameras, resulting in average dura-
tion 56 minutes per dialogue. During the conversation, 8
photographs were discussed in average (maximum was 12,
minimum 3).
3 Expressive corpus recording
3.1 Texts preparation
For developing a high-quality expressive speech synthe-
sis system, an expressive speech corpus has to be created.
Such a corpus can be then merged or just enhanced by
a neutral one to create a robust corpus containing neutral
speech as well as expressivity while keeping a maximum
speech units coverage (phonetic balance). The process of
designing texts for the expressive corpus recording is very
important. The real natural dialogues and their transcrip-
tions were taken as a basis for such a design. Thus, almost
all the texts (more than 7000 sentences) uttered by the com-
puter avatar during the natural dialogues were used. Texts
containing unﬁnished phrases due to e.g. speakers overlap-
ping were omitted. These texts form a set of sentences to
be recorded.
3.2 Recording process
For the expressive corpus recording, a method using so-
called scenarios was applied. A scenario in our case can
be viewed as a natural dialogue whose course is prepared
in advance, just with missing audio of one of the partic-
ipants (the avatar). This means that the parts of the dia-
logues to be uttered by a voice talent represent the com-
puter avatar responses and order of these parts is ﬁxed. The
parts also follow the natural dialogues and are accompa-
nied with the other participant’s original speech to provide
the voice talent with information about the context. Actu-
ally, the recording was a simulation of the natural dialogues
where the voice talent was standing for the computer avatar
and was pronouncing its sentences. This should stimu-
late the voice talent to became naturally expressive while
recording.
As the voice talent, a female professional stage-player
experienced in speech corpora recording was chosen. The
voice talent had already recorded the neutral speech corpus
for our neutral TTS system. This corresponds with the in-
tention suggested in Section 3.1 that the expressive corpus
should be enhanced by the neutral one to keep the speech
units coverage. To improve the performance of tools pro-
cessing the recorded speech corpora, glottal signal was cap-
tured along with the speech signal during the recording.
3.3 Recording application description
To record the expressive corpus using the above described
method, a special recording application was developed.
The application interface is depicted in Figure 5.
Figure 5: Interface of the application for expressive corpus
recording.
On the upper part of the application window, the text to
be recorded is displayed. However, the voice talent was
allowed to change the exact sentence wording if unclear
2
while keeping the same meaning. On the middle part, there
are, among other things, control buttons for recording and
listening. On the bottom, the waveform of the just recorded
sentence is shown. The application can be also controlled
via keyboard short-cuts to make it more comfortable for the
voice talent.
4 Expressivity description
To incorporate expressivity in synthetic speech, some kind
of its description is necessary. A general description of
expressivity is a very complex task that has not been sat-
isfactorily solved yet even though there are some studies
(e.g. [38]) dealing with this topic. For various research
ﬁelds and their tasks, there are various possibilities of ex-
pressivity description. In our work, a description making
use of so-called dialogue acts was used. It is a categori-
cal description based on a classiﬁcation of expressivity into
pre-deﬁned classes (used also in [39, 23, 15]).
Although there are several schemas describing expres-
sivity using dialogue acts (including DAMSL [40, 41],
SWBD-DAMSL [42], VERBMOBIL [43, 44] or AT&T
schema [39]), a new schema was employed to describe ex-
pressivity in our limited domain in question. The set of
proposed dialogue acts is shown in Table 1 along with a
few examples.
The deﬁnition of the dialogue acts was based on the au-
diovisual database of the natural dialogues (described in
Section 2) and on the expressive speech corpus (described
2
Since the texts for the recording were prepared automatically and
were not manually checked due to their high number, they could contain
some typos, unintelligibilities or unclarities.
Dialogue Act-Based Expressive Speech Synthesis in. . . Informatica 44 (2020) 147–165 151
dialogueact example
directive Tell me that. Talk.
request Let’s get back to that later.
wait Wait a minute. Just a moment.
apology I’m sorry. Excuse me.
greeting Hello. Good morning.
goodbye Goodbye. See you later.
thanks Thank you. Thanks.
surprise Do you really have 10 siblings?
sad empathy I’m sorry to hear that.
It’s really terrible.
happy empathy It’s nice. Great.
It had to be wonderful.
showing interest Can you tell me more about it?
conﬁrmation Yes. Yeah. I see. Well. Hmm.
disconﬁrmation No. I don’t understand.
encouragement Well. For example?
And what about you?
not speciﬁed Do you hear me well?
My name is Paul.
Table 1: Set of dialogue acts.
in Section 3). These dialogue acts are than used for expres-
sive corpus annotation (Section 5) and also in the process
of the expressive speech synthesis (Section 6).
The need for a new dialogue act schema was driven by
a deﬁnition of our speciﬁc limited domain. Most of the
dialogue acts are intended to encourage the (human) part-
ner in a dialogue to talk more about a topic while keeping
the computer dialogue system to behave more like a patient
listener.
Even though the dialogue acts schemas are generally
supposed to describe various phases of dialogues, we as-
sume that in various dialogues’ phases a speaker can
present his state of mind, mood or personal attitude in a
speciﬁc way. We believe that the proposed set of dialogue
acts can be used not only for description of various dia-
logue phases but that it also represents the speaker’s at-
titude and affective state expressed by expressive speech.
Using these dialogue acts in this limited domain, the syn-
thetic speech is supposed to become more natural for the
listeners (seniors in this case).
5 Expressive corpus annotation
The expressive speech corpus was annotated by dialogue
acts using a listening test. The test was aimed to deter-
mine objective annotation on the basis of several subjec-
tive annotations as the perception of expressivity is always
subjective and may vary depending on a particular listener.
Preparation works, listening test framework, evaluation of
listening test result and a measure of inter-rater agreement
analysis is presented in the following paragraphs.
5.1 Listening test background
The listening test was organized on the client-server ba-
sis using a specially developed web application. This way,
listeners were able to work on the test from their homes
without any contact with the test organizers. The listen-
ers were required to have only an internet connection, any
browser installed on their computers and some device for
audio playback. Various measures were undertaken to de-
tect possible cheating, carelessness or misunderstandings.
Potential test participants were addressed mostly among
university students from all faculties and the ﬁnished lis-
tening test was ﬁnancially rewarded (to increase motivation
for the listeners). The participants were instructed to listen
to the recordings very carefully and subsequently mark dia-
logue acts that are expressed within the sentence. The num-
ber of possibly marked dialogue acts for one utterance was
just upon the listeners, they were not limited anyhow. Few
sample sentences labelled with dialogue acts were provided
and available to the listeners on view at every turn. If any
listener marked one utterance with more than one dialogue
act, he was also required to specify whether the functions
occur in that sentence consecutively or concurrently. If the
dialogue acts are marked as consecutive in a particular ut-
terance, this utterance is omitted from further research for
now. These sentences should be manually reviewed later
and either divided into more shorter sentences or omitted
completely.
Finally, 12 listeners successfully ﬁnished the listening
test. However, this way we obtained subjective annotations
that vary across the listeners. To objectively annotate the
expressive recordings, a proper combination of the subjec-
tive annotations was needed. Therefore an evaluation of the
listening test was made.
5.2 Objective annotation
We utilized two ways to deduce the objective annotation.
The ﬁrst way is a simple majority method. Using this
easy and intuitive approach, each sentence is assigned a di-
alogue act that was marked by the majority of the listeners.
In case of less then 50% of all listeners marked any dia-
logue act, the classiﬁcation of this sentence is considered
as untrustworthy.
The second approach is based on maximum likelihood
method. Maximum likelihood estimation is a statistical
method used for ﬁtting a statistical model to data and pro-
viding estimates for the model’s parameters. Under cer-
tain conditions, the maximum likelihood estimator is con-
sistent. The consistency means that having a sufﬁciently
large number of observations (annotations in our case), it
is possible to ﬁnd the value of statistical model parame-
ters with arbitrary precision. The parameter calculation is
implemented using the EM algorithm [45]. Knowing the
model parameters, it is possible to deduce true observation
which is called objective annotation. Precision of the es-
timate is one of the outputs of this model. Using the pre-
cision, any untrustworthy assignment of a sentence with
152 Informatica 44 (2020) 147–165 M. Gr˚ uber et al.
a dialogue act can be eliminated.
Comparing these two approaches, 35 out of 7287 clas-
siﬁcations were marked as untrustworthy using maximum
likelihood method and 571 using simple majority method.
The average ratio of listeners who marked the same dia-
logue act for particular sentence using simple majority ap-
proach was 81%, when untrustworthy classiﬁcations were
excluded. Similar measure for maximum likelihood ap-
proach cannot be easily computed as the model parameters
and the estimate precision depend on number of iteration
in the EM algorithm.
We decided to use the objective annotation obtained by
maximum likelihood method. It is an asymptotically con-
sistent, asymptotically normal and asymptotically efﬁcient
estimate. This approach was also successfully used in other
works regarding speech synthesis research, see [46].
Further, we need to conﬁrm that the listeners marked
the sentences with dialogue acts consistently and achieved
some measure of agreement. Otherwise the subjective an-
notations could be considered as accidental or the dialogue
acts inappropriately deﬁned and thus the acquired objective
annotation would be false. For this purpose, we make use
of two statistical measures for assessing the reliability of
agreement among listeners.
One of the measures used for such evaluation is Fleiss’
kappa [47, 48]. It is a statistical measure for assessing the
reliability of agreement between a ﬁxed number of raters
when assigning categorical ratings to a number of items
or classifying items. We calculated this measure among
all listeners separately for each dialogue act. Computation
of overall Fleiss’ kappa is impossible because the listeners
were allowed to mark more than one dialogue act for each
sentence. However, the overall value can be evaluated as
the mean of Fleiss’ kappas of all dialogue acts.
Another measure used here is Cohen’s kappa [49, 48].
It is a statistical measure of inter-rater agreement for cate-
gorical items and takes into account the agreement occur-
ring by chance as well as Fleiss’ kappa. However, Cohen’s
kappa measures the agreement only between two listeners.
We decided to measure the agreement between each lis-
tener and the objective annotation obtained by maximum
likelihood method. Again, calculation of Cohen’s kappa
was made for each dialogue act separately. Thus, we can
ﬁnd out whether particular listener was in agreement with
the objective annotation for certain dialogue act. Finally,
the mean of Cohen’s kappas of all dialogue acts can be cal-
culated.
Results of agreement measures are presented in Table 2.
Values of Fleiss’ and Cohen’s kappas vary between 0 and 1,
the higher value the better agreement. More detailed inter-
pretation of measure of agreement is e.g in [50].
The Fleiss’ kappa mean value of 0.5434 means that the
measure of inter-listeners agreement is moderate. As it
is obvious from Table 2, dialogue acts OTHER and NOT-
SPECIFIED should be considered as poorly recognizable.
It is understandable when taking into consideration their
deﬁnitions. After eliminating values of these dialogue acts
the mean value of 0.6191 is achieved, which means sub-
stantial agreement among the listeners.
The Cohen’s kappa mean value of 0.6632 means that
the measure of agreement between listeners and objective
annotation is substantial. Moreover, we can again elimi-
nate dialogue acts OTHER and NOT-SPECIFIED as they
were poorly recognizable also according to Cohen’s kappa.
Thus, mean value of 0.7316 is achieved. However, it is still
classiﬁed as a substantial agreement.
As it is shown in Table 2, agreement among listeners re-
garding classiﬁcation of consecutive dialogue act was mea-
sured too. The listeners agreed on this label moderately
among each other and substantially with the objective an-
notation. There are also shown ratios of the particular di-
alogue acts occurrence when maximum likelihood method
was used for the objective annotation obtaining. It is ob-
vious that dialogue actsSHOW-INTEREST andENCOUR-
AGE are the most frequent.
6 Unit selection
6.1 General unit selection approach
In general, a unit selection algorithm (for our system de-
scribed e.g. in [51]) is used to form resulting synthetic
speech from speech units that are selected from a list of
corresponding candidate units. These candidates are stored
in a unit inventory which is built up on the basis of a speech
corpus. The unit selection process usually respects two var-
ious groups of candidates’ features.
6.1.1 Concatenation cost
Features in one group are used for a concatenation cost
computation. This cost reﬂects continuity distortion,
i.e. how smoothly each candidate for unit u
i  1
will join
with each candidate for unitu
i
in the sequence. The lower
the cost is, the less the unit boundaries are noticeable. In
this group of features, there are usually included mostly or-
dinal values (acoustic and spectral parameters of the speech
signal), e.g. some acoustic coefﬁcients, energy values, F0
values, their differences, etc. The concatenation cost for
candidateu
i
is then calculated as follows:
C
i
=
P
n
j=1
w
j
d
j
P
n
j=1
w
j
; (1)
whereC
i
is the concatenation cost of a candidate for unit
u
i
, n is a number of features under consideration, w
j
is
a weight of j-th feature andd
j
is an enumerated difference
between corresponding features of two potentially adjacent
candidates for unitsu
i  1
andu
i
— for unitu
i
the features
from the end of the originally preceding (adjacent in the
original corpus) unit are compared with the same features
from the end of unitu
i  1
.
Dialogue Act-Based Expressive Speech Synthesis in. . . Informatica 44 (2020) 147–165 153
dialogue Fleiss’s Measure Cohen’s Cohen’s Measure Occurr.
act kappa of agreement kappa kappa SD of agreement probab.
DIRECTIVE 0.7282 Substantial 0.8457 0.1308 Almost perfect 0.0236
REQUEST 0.5719 Moderate 0.7280 0.1638 Substantial 0.0436
WAIT 0.5304 Moderate 0.7015 0.4190 Substantial 0.0073
APOLOGY 0.6047 Substantial 0.7128 0.2321 Substantial 0.0059
GREETING 0.7835 Substantial 0.8675 0.1287 Almost perfect 0.0137
GOODBYE 0.7408 Substantial 0.7254 0.1365 Substantial 0.0164
THANKS 0.8285 Almost perfect 0.8941 0.1352 Almost perfect 0.0073
SURPRISE 0.2477 Fair 0.4064 0.1518 Moderate 0.0419
SAD-EMPATHY 0.6746 Substantial 0.7663 0.0590 Substantial 0.0344
HAPPY-EMPATHY 0.6525 Substantial 0.7416 0.1637 Substantial 0.0862
SHOW-INTEREST 0.4485 Moderate 0.6315 0.3656 Substantial 0.3488
CONFIRM 0.8444 Almost perfect 0.9148 0.0969 Almost perfect 0.1319
DISCONFIRM 0.4928 Moderate 0.7153 0.1660 Substantial 0.0023
ENCOURAGE 0.3739 Fair 0.5914 0.3670 Moderate 0.2936
NOT-SPECIFIED 0.1495 Slight 0.3295 0.2292 Fair 0.0736
OTHER 0.0220 Slight 0.0391 0.0595 Slight 0.0001
mean 0.5434 Moderate 0.6632 Substantial
consecutive DA 0.5138 Moderate 0.6570 0.2443 Substantial 0.0374
Table 2: Fleiss’ and Cohen’s kappa and occurrence ratio for various dialogue acts and for the “consecutive DAs” label. For Cohen’s kappa, mean value and standard deviation
is presented, since Cohen kappa is measured between annotation of each listener and the reference annotation.
154 Informatica 44 (2020) 147–165 M. Gr˚ uber et al.
6.1.2 Target cost
Features in the other group are used for a target cost com-
putation. This cost reﬂects the level of an approximation
of a target unit by any of the candidates; in other words,
how a candidate from the unit inventory ﬁts a correspond-
ing target unit — a theoretical unit whose features are spec-
iﬁed on the basis of the sentence to be synthesized. In this
group, there are usually included mostly nominal features,
e.g. phonetic context, prosodic context, position in word,
position in sentence, position in syllable, etc. The target
cost for candidateu
i
is then calculated as follows:
T
i
=
P
n
j=1
w
j
d
j
P
n
i=j
w
j
; (2)
whereT
i
is the target cost of a candidate for unitu
i
, n is
a number of features under consideration, w
j
is a weight
of j-th feature andd
j
is an enumerated difference between
j-th feature of a candidate for unitu
i
and target unitt
i
. The
differences of particular features (d
j
) can be also referred
to as penalties.
For our ARTIC TTS system, the features that are consid-
ered when calculating the target cost are shown in Table 3.
feature weight
position in a prosodic word 7:0
left phoneme context 3:0
right phoneme context 3:0
prosodeme type 14:0
voicing – at the beginning 8:5
voicing – at the end 8:5
Table 3: Prosodic features along with their weights used
for target cost calculation in the ARTIC TTS system.
6.2 Basic target cost for expressive speech
synthesis
When using the expressive speech corpus, the set of the
features used for the target cost computation is extended
with one more feature. Regarding the aforementioned ex-
pressivity description, it is calleddialogueact. The penalty
d
da
between a candidateu
i
of a target unitt
i
can be in the
easiest way calculated as follows:
d
da
=
  0 ifda
t
=da
c
1 otherwise
; (3)
whered
da
is a difference (penalty),da
t
is a dialogue act
of the target unitt
i
andda
c
is a dialogue act of the candi-
dateu
i
.
Finally, a weight for this penalty needs to be set since
the target cost is calculated as a weighted sum of particular
penalties.
6.3 Advanced target cost for expressive
speech synthesis
The target cost calculation presented in equation 3 is
very simple and it assumes that penalties for different
expressive categories (represented by the dialogue acts)
are the same. However, this is not true in most cases.
For instance, the difference between SAD-EMPATHY and
HAPPY-EMPATHY should be probably greater than a dif-
ference between SAD-EMPATHY and NEUTRAL — this
means that when synthesizing a sentence in the SAD-
EMPATHY manner and there is no available or suitable
candidate labelled with this dialogue act, it is probably bet-
ter to consider a candidate labelled with NEUTRAL dia-
logue act than considering a candidate labelled asHAPPY-
EMPATHY. Therefore, it is necessary to enumerate differ-
ences between various dialogue acts and use them for the
target cost calculation. The basics of the procedure are de-
scribed in [52], a bit enhanced version is presented here.
6.3.1 General penalty matrix
The differences are assumed to be coded in a penalty
matrix M, where coefﬁcients m
ij
represents a difference
(a penalty) between a dialogue acti and a dialogue actj.
To determine coefﬁcients of the matrix, i.e. the differ-
ences in dialogue acts, two aspects should be considered:
human perception of the speech and acoustic measures cal-
culated from the signal. Thus, two separate matrices are
created and then combined. Coefﬁcients of the ﬁrst ma-
trix P are calculated on the basis of a listening test that was
performed to annotate the dialogue acts in the expressive
speech corpus [37] (see Section 6.3.2). The second matrix
A is then based on results of an acoustic analysis of expres-
sive speech [53] (see Section 6.3.3). The combined ﬁnal
penalty matrix M represents the overall differences (penal-
ties) between various dialogue acts.
6.3.2 Listening test based differences
Given the annotations of the expressive recordings pre-
sented in Section 5, a penalty matrix P was created. Its
coefﬁcientsp
ij
were calculated according to the following
equation:
p
ij
=
abs(log(
numij
maxi
))
K
; (4)
wherenum
ij
represents how many times recordings with
dialogue acti (according to the objective annotation as pre-
sented in Section 5.2) were labelled with dialogue actj (cal-
culated over all listeners and all recordings), max
i
repre-
sents the maximum value ofnum
ij
for ﬁxed i and K is a
constant deﬁned asK  K
min
where:
K
min
= max
8i;j
(abs(log(
num
ij
max
i
)); (5)
wheremax
8i;j
is the maximum value for alli;j for which
the log is deﬁned. For situations where the log is not de-
Dialogue Act-Based Expressive Speech Synthesis in. . . Informatica 44 (2020) 147–165 155
ﬁned, thep
ij
was set asp
ij
= K. In our experiments, the
K = 5  2  K
min
. Thelog was used to emphasize differ-
ences between calculated ratios and we also assumed that
the human perception is logarithmic-based (as suggested
e.g. by The Weber-Fechner Law).
6.3.3 Acoustic analysis based differences
An extensive acoustic analysis of the expressive corpus was
performed in [53]. On the basis of this analysis, a penalty
matrix A was created. Its coefﬁcientsa
ij
were calculated
as the Euclidean distance between numeric vectors repre-
senting the dialogue actsi andj in a 12-dimensional space.
The components of the vector consist of normalized val-
ues of 4 statistical characteristics (mean value, standard
deviation, skewness, kurtosis) for 3 acoustic parameters
(F0 value, RMS energy and unit duration). The acoustic
analysis proved that these features can be used as acoustic
distance measures for this purpose. It is likely that there
other features not considered in this work which may af-
fect the measure in any way and whose inﬂuence should be
explored in the future.
6.3.4 Final penalty matrix
The ﬁnal penalty matrix containing numeric differences be-
tween various dialogue acts is an appropriate combination
of two separate penalty matrices (matrix P based on the an-
notations and matrix A based on the acoustic analysis). The
coefﬁcientsm
ij
of matrix M can be calculated as follows:
m
ij
=
w
p
  p
ij
+w
a
  a
ij
w
p
+w
a
; (6)
where p
ij
and a
ij
represent coefﬁcients from matrices P
and A,w
p
andw
a
are corresponding weights.
After several experiments, valuesw
p
= 3 andw
a
= 1
were used as the weights. Using this setting, the best re-
sults were achieved when subjectively comparing resulting
synthetic speech. We also believe that the perceptual part
should be emphasized. The ﬁnal penalty matrix is depicted
in Table 4.
6.3.5 Weight tuning for dialogue act feature
Proper setting of a weight for any of the features is not
an easy task. Some techniques for automatic settings have
also been developed [54, 55]. However, in our system the
settings shown in Table 3 is used as it was proved to be
appropriate in applications of our TTS.
To set the weight for the dialogue act feature, sets of syn-
thetic utterances were generated for various settings. Using
a subjective evaluation (a brief listening test) and consider-
ing weights for other features, the ﬁnal weight was deﬁned
asw
DA
= 12:0. When compared with Table 3, this weight
is one of the highest among others.
7 HMM algorithm
modiﬁcation/training
Along with the concatenative unit selection method, sta-
tistical parametric speech synthesis based on using hid-
den Markov models (abbreviated as HMM-based speech
synthesis) is one of the most researched synthesis meth-
ods [28]. Several experiments on using this synthesis
method for generating expressive speech are described
in [14]. In the HMM approach, statistical models (an ex-
tended type of HMMs) are trained from natural speech
database. Spectral parameters, fundamental frequency and
eventually some excitation parameters are modelled simul-
taneously by the corresponding multi-stream HMMs.
The variability of speech is modelled by using models
with large context description, i.e. individual models are
deﬁned for various phonetic, prosodic and linguistic con-
texts, that are described by so-called contextual factors.
The contextual factors employed in our experiments are
listed in Table 5. For more details, see e.g. [56].
To increase the robustness of the estimated model pa-
rameters, models of acoustically similar units are clustered
by a decision-tree-based context-clustering algorithm. As
a result, similar units share one common model.
Within the HMM-based speech synthesis, various meth-
ods for modelling the expressivity or speaking styles have
been introduced. The simplest one uses so-called style de-
pendent models [59], i.e. an independent set of HMMs is
trained for each expression. An obvious drawback of this
approach is a large amount of training data required for par-
ticular expressions.
A better solution are so-called style mixed models [59],
where one set of HMMs is trained for all expressions to-
gether and particular expressions are distinguished by in-
troducing an additional contextual factor. Then, models of
units that are acoustically similar for more expressions are
clustered. Independent models are trained only when there
is a signiﬁcant difference between particular expressions.
Another option of modelling expressions are methods
based on model adaptation [60, 61]; they are usually pre-
ferred because they allow to control the speech style or
expression more precisely and require less training data.
However, the style mixed model utilizing an additional
contextual factor for dialogue act was used in this work.
8 Evaluation & results
This section deals with an evaluation of the procedure
described in this paper to verify that it fulﬁls the goals
which were speciﬁed at the beginning. Especially, it should
be veriﬁed that listeners perceive the synthetic speech
produced by the developed system as expressive (Sec-
tion 8.2.1) and also how the quality of synthetic speech
changed in comparison with the baseline system (Sec-
tion 8.2.2). Since the proposed TTS system is focused on a
usage in a speciﬁed dialogue system, the suitability of the
156 Informatica 44 (2020) 147–165 M. Gr˚ uber et al.
APOLOGY
CONFIRM
DIRECTIVE
DISCONFIRM
ENCOURAGE
GOODBYE
GREETING
HAPPY-EMPATHY
NOT-SPECIFIED
OTHER
REQUEST
SAD-EMPATHY
SHOW-INTEREST
SURPRISE
THANKS
WAIT
NEUTRAL
APOLOGY 0:00 0:58 0:39 0:40 0:83 0:25 0:90 0:48 0:36 0:50 0:84 0:17 0:45 0:48 0:83 0:47 0:78
CONFIRM 0:96 0:00 0:71 0:58 0:71 0:95 0:92 0:40 0:48 0:72 0:93 0:41 0:50 0:53 0:74 0:70 0:72
DIRECTIVE 0:90 0:56 0:00 0:58 0:26 0:86 0:46 0:50 0:36 0:65 0:33 0:49 0:38 0:81 0:92 0:59 0:41
DISCONFIRM 0:34 0:33 0:84 0:00 0:43 0:87 0:86 0:28 0:23 0:40 0:44 0:28 0:42 0:42 0:90 0:45 0:54
ENCOURAGE 0:83 0:50 0:48 0:64 0:00 0:83 0:60 0:35 0:31 0:43 0:27 0:39 0:14 0:27 0:75 0:52 0:55
GOODBYE 0:30 0:53 0:34 0:87 0:47 0:00 0:59 0:19 0:17 0:50 0:82 0:25 0:42 0:82 0:25 0:55 0:61
GREETING 0:90 0:63 0:86 0:86 0:58 0:90 0:00 0:52 0:31 0:68 0:89 0:90 0:53 0:87 0:93 0:54 0:42
HAPPY-EMPATHY 0:44 0:25 0:58 0:41 0:24 0:39 0:89 0:00 0:21 0:45 0:54 0:29 0:27 0:28 0:46 0:54 0:58
NOT-SPECIFIED 0:39 0:26 0:25 0:35 0:12 0:19 0:26 0:15 0:00 0:34 0:20 0:22 0:13 0:18 0:32 0:38 0:46
OTHER 0:95 1:00 0:29 0:97 0:94 0:36 0:97 0:96 0:30 0:00 0:94 0:99 0:94 0:95 0:92 0:89 0:84
REQUEST 0:84 0:93 0:30 0:88 0:17 0:82 0:35 0:45 0:31 0:40 0:00 0:51 0:28 0:80 0:87 0:58 0:59
SAD-EMPATHY 0:28 0:28 0:37 0:41 0:26 0:32 0:90 0:33 0:28 0:50 0:49 0:00 0:25 0:33 0:89 0:59 0:67
SHOW-INTEREST 0:88 0:53 0:39 0:62 0:15 0:85 0:87 0:43 0:27 0:58 0:32 0:41 0:00 0:34 0:90 0:57 0:45
SURPRISE 0:86 0:29 0:47 0:40 0:05 0:82 0:87 0:15 0:14 0:37 0:35 0:24 0:06 0:00 0:59 0:51 0:51
THANKS 0:83 0:54 0:92 0:90 0:53 0:46 0:93 0:88 0:91 0:92 0:87 0:89 0:90 0:89 0:00 0:88 0:86
WAIT 0:47 0:58 0:24 0:89 0:32 0:93 0:88 0:55 0:56 0:89 0:29 0:57 0:39 0:90 0:88 0:00 0:63
NEUTRAL 0:78 0:72 0:41 0:54 0:55 0:61 0:42 0:58 0:46 0:84 0:59 0:67 0:45 0:51 0:86 0:63 0:00
Table 4: Final penalty matrix M.
Dialogue Act-Based Expressive Speech Synthesis in. . . Informatica 44 (2020) 147–165 157
Contextual factor Possible values
Left and right phonetic context Czech phonetic alphabet [57]
Phone position in prosodic word
1, 2, 3, 4, 5 ...
(forward and backward)
Prosodic word position in clause
(forward and backward)
Prosodeme
terminating satisfactorily,
terminating unsatisfactorily,
non-terminating, null
Dialogue act see Section 4
Table 5: A list of contextual factors and their values.
Prosodic words, clauses and prosodemes are thoroughly
described in [58].
expressive speech synthesis in such a dialogue system is
also evaluated (Section 8.4).
During the design of the expressive TTS system, it
turned out that some of the dialogue acts (further referred to
as DAs) appear much more frequently than others, some of
them are very rare. Thus, only the most frequent DAs were
used to evaluate the system and they were divided into two
separate groups:
Expressive dialogue acts:
– SHOW-INTEREST – relative frequency34:9%;
– ENCOURAGE – relative frequency29:4%;
– CONFIRM – relative frequency13:2%;
– HAPPY-EMPATHY – relative frequency8:6%;
– SAD-EMPATHY – it was added because it is consid-
ered to be an opposite toHAPPY-EMPATHY dialogue
act; relative frequency3:4%;
Neutral dialogue acts:
– NOT-SPECIFIED – besides it is one of the most fre-
quently occurring DAs, it should also represents the
neutral synthetic speech; relative frequency7:4%;
– NEUTRAL – this is not a DA per se, it is deﬁned here
to represent the neutral speech produced by the cur-
rent baseline TTS system for the purposes of the eval-
uation.
All the listening tests described further were performed
using the same system as it was used for the expressive cor-
pus annotation (described in Section 5.1). Of course, the
questions and options were different within this evaluation
but the core of the system is the same. The majority of lis-
tening tests participants were experts in speech or language
processing, some of them were university students. Texts
of synthesized utterances were not a part of the corpora,
new texts were created for this purpose. The content of the
texts corresponds to the dialogue act to be synthesized (for
expressive synthesis), or it is neutral (for neutral synthesis).
8.1 Expressivity perception in natural
speech
Before assessing the synthetic expressive speech, a lis-
tening test focused on expressivity perception in natural
speech was performed. This gives us a brief overview of
how the listeners are able to perceive the expressivity and
later a comparison between expressivity perception in nat-
ural and synthetic speech can be presented.
All the listeners were assessing randomly selected ut-
terances form the natural corpora (neutral and expressive)
and their task was to mark if they perceive any kind of ex-
pressivity or not or if they are not able to make a decision.
14 listeners participated in this test, each listener was pre-
sented with 34 utterances – 4 for each expressive dialogue
act being evaluated and 7 for each dialogue act considered
as neutral (i.e. NOT-SPECIFIED and NEUTRAL). The re-
sults are depicted in Figure 6 and also shown in Table 6.
Figure 6: Expressivity perception in natural speech.
dialogue act
expressivity cannot
perception ratio decide
CONFIRM 38% 3%
ENCOURAGE 61% 7%
HAPPY-EMPATHY 77% 4%
SAD-EMPATHY 73% 6%
SHOW-INTEREST 18% 11%
mean 53% 6%
NOT-SPECIFIED 42% 13%
NEUTRAL 36% 3%
mean 39% 7%
Table 6: Expressivity perception in natural speech.
The results are quite surprising, especially for neutral
speech. In 39 % of neutral natural utterances (in average,
includingNOT-SPECIFIED), the listeners perceived an ex-
pressivity. It seems that some kind of expressivity is in-
cluded even in the neutral corpus and the listeners are very
158 Informatica 44 (2020) 147–165 M. Gr˚ uber et al.
sensitive to that, and they are able to perceive it. This fact
can be related to the content of speech since as it was de-
scribed in [62], the content as such might also inﬂuence the
listeners’ expressivity perception.
The results for the expressive DAs depends on a par-
ticular DA. For instance, utterances marked as HAPPY-
EMPATHY and SAD-EMPATHY are mostly recognized
as expressive whereas utterances marked as SHOW-
INTEREST are not.
These results give us a baseline for the evaluation of ex-
pressive synthetic speech. Since for some DAs the listeners
don’t perceive expressivity even in natural speech, it’s un-
likely that they will perceive it in synthetic speech.
8.2 Evaluation of the unit selection based
expressive speech synthesis
During the evaluation of expressive synthetic speech, two
main factors were investigated – expressivity perception
and speech quality. It’s supposed that the quality of syn-
thetic speech will be affected by the expressivity integra-
tion as the expressive speech is much more dynamic and
thus more artiﬁcial artifacts may occur. This section deals
with the evaluation of expressive synthetic speech pro-
duced by the unit selection TTS system. The evaluation
of HMM-based TTS system is presented in section 8.3.
In the listening tests regarding expressive synthetic
speech evaluation, 13 listeners assessed 30 utterances –
4 for each DA in question and 2 for natural neutral speech
(so that a comparison of speech quality can be performed).
8.2.1 Expressivity perception in synthetic speech
The results for expressivity perception in synthetic expres-
sive speech are depicted in Figure 7 and presented in Ta-
ble 7.
Figure 7: Expressivity perception in synthetic speech (unit
selection).
Again, a surprising result can be observed for natural
neutral speech as an expressivity was perceived at a quite
dialogue act
expressivity cannot
perception ratio decide
CONFIRM 69% 4%
ENCOURAGE 42% 8%
HAPPY-EMPATHY 50% 10%
SAD-EMPATHY 63% 4%
SHOW-INTEREST 46% 4%
mean 54% 6%
NOT-SPECIFIED 10% 0%
NEUTRAL 15% 0%
mean 13% 0%
natural speech
42% 4%
(neutral)
Table 7: Expressivity perception in synthetic speech (unit
selection).
high ratio (42%). However, it is consistent with the previ-
ous results presented in Table 6 (39%).
For synthetic speech generated asNOT-SPECIFIED and
for baseline neutral synthetic speech (marked as NEU-
TRAL), almost no expressivity was perceived. On the other
hand, for expressive DAs, the expressivity perception ra-
tio was quite high (mean value 54 %) and it was even
slightly higher than for expressive natural speech (mean
value53%, see Table 6).
To verify that the achieved results are not random, a sta-
tistical measure for listeners agreement (the Fleiss’ kappa
was used here) was calculated. Its value varies in the range
<   1;1 > and a positive value indicates an agreement
above the chance level. In our experiment, the Fleiss’
kappa was calculated as  F
= 0:37 which means a moder-
ate agreement.
In addition, other measures might be used to verify the
results; for instance precision, recall, F1 and accuracy
measures which are mostly used for evaluation of classi-
ﬁers in classiﬁcation tasks. However, the presented listen-
ing test can be also viewed as a classiﬁcation task where
the listeners as classiﬁers classify into two distinct classes:
perceive and do not perceive expressivity (the cannot de-
cide answers were not considered in this veriﬁcation). The
measure are determined as follows:
P =
t
p
p
p
; R =
t
p
a
p
F1 =
2  P  R
P +R
; A =
t
p
+t
n
a
p
+a
n
whereP is precision, the ability of a listener not to per-
ceive a neutral sentence as expressive;R isrecall (alsosen-
sitivity), the ability of a listener not to perceive expressive
sentences as neutral;A isaccuracy, the ability of a listener
to perceive expressivity in expressive sentences and not to
perceive it in neutral sentences; F1 is the harmonic mean
Dialogue Act-Based Expressive Speech Synthesis in. . . Informatica 44 (2020) 147–165 159
of precision and recall; t
p
means “true positives” (i.e. the
number of expressive sentences correctly perceived as ex-
pressive); t
n
means “true negatives” (i.e. the number of
neutral sentences correctly perceived as neutral);p
p
stands
for “predicted positives” (i.e. the number of all sentences
perceived as expressive); a
p
stands for “actual positives”
(i.e. the number of all actual expressive sentences); a
n
means “actual negatives” (i.e. the number of all actual neu-
tral sentences).
The calculated values of these measures are presented in
Table 8 altogether with values that would be achieved in
case the expressivity perception is assessed completely at
random.
measure real listeners random assessment
precision 0:92 0:72
recall 0:58 0:50
F1 measure 0:71 0:59
accuracy 0:66 0:50
Table 8: Statistical measures for expressivity perception
listening test and comparison with completely random as-
sessment.
As the veriﬁcation indicates, the expressivity perception
ratio in synthetic speech is not a result of a random pro-
cess. It’s necessary to note that there are two main facts
which affect the expressivity perception. The ﬁrst one is
the TTS system and the synthetic speech whose evaluation
is the main goal. The second fact is the listeners – each
of them might perceive (assess) various intensity of vari-
ous expressivity categories differently. However, the main
task here is not to evaluate the listeners and if they are or
they are not able to perceive an expressivity (which is ba-
sically impossible). The listeners are just believed to and
the only thing that can be done is to perform some kind of
agreement measure calculation.
In synthetic expressive speech generated with a particu-
lar DA in mind, the relative ratio between units originally
coming from utterances labelled with this DA and units
coming from other utterances can be measured. The ra-
tio might vary depending on setting of the weight for the
dialogue act feature. The calculated ratios for the current
weight settings (as designed in Section 6.3.5) are shown in
Figure 8.
It’s worth noting that the measure is very low for NOT-
SPECIFIED DA. However, after further investigation, it
turned out that when synthesizing utterances for this DA,
units coming from the neutral corpus (NEUTRAL) were
mostly selected. It supports the assumption that the NOT-
SPECIFIED DA represents neutral speech (although in
the ﬁnal penalty matrix M the distance between NOT-
SPECIFIED and NEUTRAL was calculated as 0:46 which
is quite high). It also seems that there is no strong rela-
tion between this measure and the expressivity perception
results presented in Table 7.
Figure 8: Relative ratio of units coming from utterances
labelled with the DA which was intended to be synthesized.
8.2.2 Quality evaluation
To investigate whether the synthetic speech quality deterio-
rated by adding the expressivity, a MOS test evaluation was
performed. In the MOS test, the listeners assess the speech
quality using a 5-point scale where, in theory, the natural
speech should be evaluated as 5 (100%) and a very unnat-
ural speech as 1 (0 %). The test was running along with
the expressivity perception test, i.e. the test conditions, test
utterances and the listeners were the same as for the evalu-
ation that is presented in Section 8.2.1. The results of this
MOS test are shown in Figure 9 and also in Table 9 alto-
gether with a relative comparison with the natural speech
(whose result is evaluated as100%).
Figure 9: Evaluation of speech quality using a MOS test
(unit selection).
The results suggest that the quality of expressive syn-
thetic speech is worse than the quality of neutral synthetic
speech by 0:49 of the MOS score (13 %) in average. It is
almost the same difference as between natural speech and
neutral synthetic speech (0:65 of the MOS score). This de-
terioration is probably caused by greater variability of the
acoustic signal of expressive speech. Thus, the artifacts
might occur more often than in neutral synthetic speech.
An auxiliary measure called smooth joints can be also
calculated. A smooth joint is a concatenation point of two
160 Informatica 44 (2020) 147–165 M. Gr˚ uber et al.
dialogue act
MOS score comparison with
mean std natural speech
CONFIRM 3:87 1:11 79%
ENCOURAGE 3:48 0:97 68%
HAPPY-EMPATHY 3:10 1:00 58%
SAD-EMPATHY 3:87 0:94 79%
SHOW-INTEREST 3:25 0:92 62%
mean 3:51 0:99 69%
NOT-SPECIFIED 3:92 0:78 81%
NEUTRAL 4:08 0:78 83%
mean 4:00 0:78 82%
natural speech 4:65 0:48 100%
Table 9: Evaluation of speech quality using a MOS test
(unit selection).
speech units that were originally adjacent in the speech cor-
pus and thus their concatenation is natural. The smooth
joints measure indicates the relative ratio of such joints
with respect to the number of all concatenation points. The
calculated values are presented in Figure 10. It is assumed
that the less smooth joints in synthetic speech, the more ar-
tifacts can occur, causing the synthetic speech quality to be
worse.
Figure 10: Relative ratio of smooth joints.
It is obvious that the relative ratio of smooth joints is
almost the same regardless of the DA (mean79%) and also
in comparison with neutral synthetic speech (mean 82 %).
Also, this measure seems to be unrelated to the expressivity
perception measure or the MOS score.
8.3 Evaluation of the HMM-based
expressive speech synthesis
Even though this work deals mostly with the unit selec-
tion speech synthesis, the results of an experiment with the
HMM-based expressive speech synthesis are to be brieﬂy
discussed in this section. The used method is based on the
HTS system [63] and adapted to the Czech language [56].
The experiment is described in more details in [62] and the
HMM approach is also brieﬂy presented in Section 7. The
aim is to evaluate the capability of the HMM-based TTS
system to produce expressive speech (shown in Table 10)
and to evaluate its quality (Table 11). The presented re-
sults are summarized and various DAs are not differenti-
ated. There were 12 listeners participating in these listen-
ing tests.
dialogue act
expressivity cannot
perception ratio decide
expressive 15% 5%
NOT-SPECIFIED 8% 3%
Table 10: Expressivity perception in synthetic speech
(HMM).
dialogue act
MOS score comparison with
mean natural speech
expressive DAs
2:71 50%
+ NOT-SPECIFIED
natural speech 4:44 100%
Table 11: Evaluation of speech quality using a MOS test
(HMM).
The expressivity perception ratio in synthetic speech
produced by the HMM-based expressive TTS system is at a
very low level (15%) in comparison with the unit selection
TTS system (54%). Also the quality of synthetic speech is
much worse,2:7 of the MOS score (50% of natural speech)
for the HMM-based system and3:5 (69%) for the unit se-
lection system. Generally, the HMM-based speech synthe-
sis for the Czech language is not yet at such a high level
as the unit selection approach is. Moreover, by adding ex-
pressive speech into this process, the trained HMM models
may in fact mix natural and expressive acoustic signal de-
pending on how the decision trees were created. Thus, in
such synthetic speech of a lower quality, it is probably hard
to identify any kind of expressivity.
8.4 Evaluation of the expressivity in
dialogues
Since the unit selection expressive speech synthesis is go-
ing to be used in a speciﬁc dialogue system (conversations
between seniors and a computer; see Section 1 and 2), it is
necessary to evaluate it also with respect to this purpose. A
preference listening test was used to perform this kind of
evaluation. The test stimuli were prepared as follows:
– 6 appropriate parts of the natural dialogues (see Sec-
tion 2), each approximately 1 minute in length, were
randomly selected. The appropriateness were deter-
mined on the basis of sufﬁciency of the avatar’s in-
teractions within the dialogues. These parts will be
further referred to asminidialogues.
Dialogue Act-Based Expressive Speech Synthesis in. . . Informatica 44 (2020) 147–165 161
– The acoustic signal of each minidialogue was split-
ted into parts where the person is speaking and parts
where the avatar responses are expressed by the neu-
tral speech synthesis.
– The text contents of the avatar’s responses were
slightly modiﬁed so that the newly generated re-
sponses are really to be synthesized and not only
played back. The sense of the utterances was of course
kept the same so that the dialogue ﬂow is not dis-
rupted.
– The new texts (avatar’s responses) were synthesized
using both the baseline neutral TTS system and the
newly developed expressive TTS system – before the
expressive speech synthesis, the texts were labelled by
presumably appropriate DAs.
– In some parts of the minidialogues where the person is
originally speaking, little modiﬁcations were done so
that the length of the person’s speech was shortened –
for instance, the parts where the person was speaking
for a long time or where a long silence was detected
were removed. Again, the natural dialogue ﬂow was
not disrupted.
– the parts of the minidialogues were joint together so
that two versions of each minidialogue were created –
the ﬁrst one with the avatar’s responses with neutral
synthetic speech and the second one with the avatar’s
responses with expressive synthetic speech.
Each of the 6 minidialoges contains 4 avatar’s re-
sponses in average expressing various DAs, mostlySHOW-
INTEREST or ENCOURAGE. However, each evaluated
DA was included at least once in the responses. The mini-
dialogues were then presented to the listeners within a lis-
tening test, both minidialogue’s variants in a single test
query. The task for the listeners was to decide which vari-
ant is more natural, more pleasant and which one would
they prefer when being in place of the human minidialogue
participant. The results of this evaluation are presented in
Table 12; there were 11 listeners participating in this listen-
ing test.
synthesis variant preference
neutral 8%
expressive 83%
cannot decide 9%
Table 12: Evaluation of neutral vs. expressive speech syn-
thesis in dialogues.
It’s obvious that the listeners preferred the expressive
speech synthesis to the neutral one (83% preference ratio).
This is one of the most important results indicating that the
developed system increases the user experience with the
TTS system for this limited domain task.
To verify that the avatar’s responses were indeed syn-
thesized and not only played back, the measure of smooth
joints can be used. The mean value of this measure for
the expressive avatar’s responses is 86 % which is slightly
higher than it was measured in Figure 10 of Section 8.2.1
(mean 82 % for neutral speech and 79 % for expressive
speech). However, it still means that the responses were
really synthesized.
9 Conclusion
It is necessary to incorporate some kind of expressivity
into synthetic speech as it improves the user experience
with systems using speech synthesis technology. Expres-
sive speech sounds more naturally in dialogues between
humans and computers. There are several ways to make
the synthetic speech sound expressively. In this work, ex-
pressivity described by dialogue acts was employed and the
algorithms of the TTS system were modiﬁed to use that in-
formation when producing synthetic speech.
The results presented in Section 8 suggest that in speech
produced by the expressive TTS system the listeners per-
ceived some kind of expressivity. More importantly, it was
also conﬁrmed that in the dialogues within the discussed
limited domain, expressive speech is more suitable and pre-
ferred than the pure neutral speech produced by the base-
line TTS system even though its quality is little bit worse.
Although the development of the expressive TTS sys-
tem was done within a limited domain of conversations
about personal photos between humans and a computer,
the whole procedure – data collecting, data annotation,
expressive corpus preparation and recording, expressivity
description and TTS system modiﬁcation – can be used
within any other limited domain if appropriate expressivity
deﬁnition is used. Thus, an expressivity can be incorpo-
rated to any other dialogue system with a similar structure.
Acknowledgement
This research was supported by the Czech Science Founda-
tion (GA CR), project No. GA19-19324S and by the Min-
istry of Education, Youth and Sports of the Czech Republic
project No. LO1506.
References
[1] J. D. Williams, S. Young, Partially observable
Markov decision processes for spoken dialog sys-
tems, Computer Speech and Language 21 (2) (2007)
393–422.
https://doi.org/10.1016/j.csl.2006.
06.008
[2] O. Lemon, K. Georgila, J. Henderson, M. Stuttle,
An ISU dialogue system exhibiting reinforcement
learning of dialogue policies: generic slot-ﬁlling
162 Informatica 44 (2020) 147–165 M. Gr˚ uber et al.
in the TALK in-car system, in: Proceedings of the
Eleventh Conference of the European Chapter of the
Association for Computational Linguistics: Posters
& Demonstrations, EACL ’06, Association for
Computational Linguistics, Stroudsburg, PA, USA,
2006, pp. 119–122.
https://doi.org/10.3115/1608974.
1608986
[3] X. Wu, M. Xu, W. Wu, Preparing for evaluation of
a ﬂight spoken dialogue system, in: Proceedings of
ISCSLP, 2002, paper 50.
[4] J. Švec, L. Šmídl, Prototype of Czech spoken dialog
system with mixed initiative for railway information
service, in: P. Sojka, A. Horák, I. Kopecek, K. Pala
(Eds.), Text, Speech and Dialogue, V ol. 6231 of
Lecture Notes in Computer Science, Springer,
Berlin-Heidelberg, Germany, 2010, pp. 568–575.
https://doi.org/10.1007/
978-3-642-15760-8\_72
[5] A. Meštrovi´ c, L. Berni´ c, M. Pobar, S. Martinˇ ciˇ c-
Ipši´ c, I. Ipši´ c, Overview of a croatian weather do-
main spoken dialog system prototype, in: 32nd Inter-
national Conference on Information Technology In-
terfaces (ITI), Cavtat, Dubrovnik, 2010, pp. 103–108.
[6] A. W. Black, Unit selection and emotional speech,
in: Proceedings of Eurospeech, Geneva, Switzerland,
2003, pp. 1649–1652.
[7] M. Bulut, S. S. Narayanan, A. K. Syrdal, Expres-
sive speech synthesis using a concatenative synthe-
siser, in: Proceedings of the 7th International Confer-
ence on Spoken Language Processing – ICSLP, Den-
ver, CO, USA, 2002, pp. 1265–1268.
[8] W. Hamza, R. Bakis, E. M. Eide, M. A. Picheny, J. F.
Pitrelli, The IBM expressive speech synthesis system,
in: Proceedings of the 8th International Conference
on Spoken Language Processing – ISCLP, Jeju,
Korea, 2004, pp. 2577–2580.
https://doi.org/10.1109/tasl.2006.
876123
[9] I. Steiner, M. Schröder, M. Charfuelan, A. Klepp,
Symbolic vs. acoustics-based style control for expres-
sive unit selection, in: Seventh ISCA Tutorial and Re-
search Workshop on Speech Synthesis, Kyoto, Japan,
2010, pp. 114–119.
[10] J. Lorenzo-Trueba, G. E. Henter, S. Takaki, J. Yam-
agishi, Y . Morino, Y . Ochiai, Investigating different
representations for modeling and controlling multiple
emotions in DNN-based speech synthesis, Speech
Communication 99 (2018) 135–143.
https://doi.org/10.1016/j.specom.
2018.03.002
[11] S. An, Z. Ling, L. Dai, Emotional statistical paramet-
ric speech synthesis using LSTM-RNNs, in: 2017
Asia-Paciﬁc Signal and Information Processing As-
sociation Annual Summit and Conference (APSIPA
ASC), IEEE, 2017, pp. 1613–1616.
https://doi.org/10.1109/apsipa.
2017.8282282
[12] H. Li, Y . Kang, Z. Wang, EMPHASIS: An emotional
phoneme-based acoustic model for speech synthesis
system, in: Proceedings of Interspeech, 2018.
https://doi.org/10.21437/
interspeech.2018-1511
[13] S. Krstulovic, A. Hunecke, M. Schroder, An HMM-
based speech synthesis system applied to German
and its adaptation to a limited set of expressive foot-
ball announcements, in: Proceedings of Interspeech,
Antwerp, Belgium, 2007, pp. 1897–1900.
[14] B. Picart, R. Brognaux, , T. Drugman, HMM-based
speech synthesis of live sports commentaries: Inte-
gration of a two-layer prosody annotation, in: 8th
ISCA Speech Synthesis Workshop, Barcelona, Spain,
2013.
[15] H. Yang, H. Meng, L. Cai, Modeling the acoustic cor-
relates of dialog act for expressive Chinese TTS syn-
thesis, IET Conference Publications 2008 (CP544)
(2008) 49–53.
https://doi.org/10.1049/cp:20080758
[16] P. Ircing, J. Romportl, Z. Loose, Audiovisual inter-
face for Czech spoken dialogue system, in: IEEE
10th International Conference on Signal Processing
Proceedings, Institute of Electrical and Electronics
Engineers, Inc., Beijing, China, 2010, pp. 526–529.
https://doi.org/10.1109/icosp.2010.
5656088
[17] J. F. Kelley, An iterative design methodology for
user-friendly natural language ofﬁce information
applications, ACM Transactions on Information
Systems 2 (1) (1984) 26–41.
https://doi.org/10.1145/357417.
357420
[18] S. Whittaker, M. Walker, J. Moore, Fish or fowl:
A Wizard of Oz evaluation of dialogue strategies in
the restaurant domain., in: Language Resources and
Evaluation Conference, Gran Canaria, Spain, 2002.
[19] M. Hajdinjak, F. Miheliˇ c, The Wizard of Oz system
for weather information retrieval, in: V . Matoušek,
P. Mautner (Eds.), Text, Speech and Dialogue, pro-
ceedings of the 6th International Conference TSD,
V ol. 2807 of Lecture Notes in Computer Science,
Springer, Berlin-Heidelberg, Germany, 2003, pp.
400–405.
https://doi.org/10.1007/
978-3-540-39398-6\_57
Dialogue Act-Based Expressive Speech Synthesis in. . . Informatica 44 (2020) 147–165 163
[20] J. A. Russell, A circumplex model of affect, Jour-
nal of Personality and Social Psychology 39 (1980)
1161–1178.
[21] A. Mehrabian, Pleasure-arousal-dominance: A gen-
eral framework for describing and measuring individ-
ual differences in temperament, Current Psychology
14 (1996) 261–292.
https://doi.org/10.1007/BF02686918
[22] R. R. Cornelius, The science of emotion: Research
and tradition in the psychology of emotions, Prentice-
Hall, Englewood Cliffs, NJ, USA, 1996.
[23] A. K. Syrdal, A. Conkie, Y .-J. Kim, M. Beutnagel,
Speech acts and dialog TTS, in: Proceedings of the
7th ISCA Speech Synthesis Workshop – SSW7, Ky-
oto, Japan, 2010, pp. 179–183.
[24] E. Zovato, A. Pacchiotti, S. Quazza, S. Sandri, To-
wards emotional speech synthesis: A rule based ap-
proach, in: Proceedings of the 5th ISCA Speech Syn-
thesis Workshop – SSW5, Pittsburgh, PA, USA, 2004,
pp. 219–220.
[25] J. M. Montero, J. Gutiérrez-Ariola, S. Palazuelos,
E. Enríquez, S. Aguilera, J. M. Pardo, Emotional
speech synthesis: From speech database to TTS, in:
Proceedings of the 5th International Conference on
Spoken Language Processing – ICSLP, V ol. 3, Syd-
ney, Australia, 1998, pp. 923–926.
[26] J. F. Pitrelli, R. Bakis, E. M. Eide, R. Fernandez,
W. Hamza, M. A. Picheny, The IBM expressive text-
to-speech synthesis system for American English,
IEEE Transactions on Audio, Speech, and Language
Processing 14 (4) (2006) 1099–1108.
https://doi.org/10.1109/tasl.2006.
876123
[27] A. J. Hunt, A. W. Black, Unit selection in a con-
catenative speech synthesis system using a large
speech database, in: IEEE International Conference
on Acoustics, Speech, and Signal Processing, V ol. 1,
1996, pp. 373–376.
https://doi.org/10.1109/ICASSP.
1996.541110
[28] H. Zen, K. Tokuda, A. W. Black, Statistical para-
metric speech synthesis, Speech Communication 51
(2009) 1039–1064.
https://doi.org/10.1016/j.specom.
2009.04.004
[29] H. Zen, A. Senior, M. Schuster, Statistical parametric
speech synthesis using deep neural networks, in:
2013 IEEE International Conference on Acoustics,
Speech and Signal Processing, IEEE, 2013, pp.
7962–7966.
https://doi.org/10.1109/ICASSP.
2013.6639215
[30] A. van den Oord, S. Dieleman, H. Zen, K. Simonyan,
O. Vinyals, A. Graves, N. Kalchbrenner, A. Senior,
K. Kavukcuoglu, Wavenet: A generative model for
raw audio, in: Arxiv, 2016. arXiv:1609.03499v2.
[31] Y . Wang, R. Skerry-Ryan, D. Stanton, Y . Wu, R. J.
Weiss, N. Jaitly, Z. Yang, Y . Xiao, Z. Chen, S. Ben-
gio, et al., Tacotron: Towards end-to-end speech
synthesis, arXiv preprint arXiv:1703.10135
https://doi.org/10.21437/
interspeech.2017-1452
[32] A. Kain, M. W. Macon, Spectral voice conversion for
text-to-speech synthesis, in: Proceedings of the IEEE
International Conference on Acoustics, Speech and
Signal Processing, V ol. 1, 1998, pp. 285–288.
https://doi.org/10.1109/icassp.
1998.674423
[33] H. Kawanami, Y . Iwami, T. Toda, H. Saruwatari,
K. Shikano, GMM-based voice conversion applied
to emotional speech synthesis, IEEE Tranactions on
Speech and Audio Processing 7 (1999) 2401–2404.
[34] J. Parker, Y . Stylianou, R. Cipolla, Adaptation of an
expressive single speaker deep neural network speech
synthesis system, in: 2018 IEEE International Con-
ference on Acoustics, Speech and Signal Processing
(ICASSP), 2018, pp. 5309–5313.
https://doi.org/10.1109/ICASSP.
2018.8461888
[35] J. Matoušek, D. Tihelka, J. Romportl, Current state of
Czech text-to-speech system ARTIC, in: Text, Speech
and Dialogue, proceedings of the 9th International
Conference TSD, V ol. 4188 of Lecture Notes in Com-
puter Science, Springer, Berlin-Heidelberg, Germany,
2006, pp. 439–446.
https://doi.org/10.1007/11846406_55
[36] D. Tihelka, J. Kala, J. Matoušek, Enhancements of
Viterbi search for fast unit selection synthesis, in:
Proceedings of Interspeech, Makuhari, Japan, 2010,
pp. 174–177.
[37] M. Gr˚ uber, M. Legát, P. Ircing, J. Romportl, J. Psutka,
Czech Senior COMPANION: Wizard of Oz data col-
lection and expressive speech corpus recording and
annotation, in: Z. Vetulani (Ed.), Human Language
Technology. Challenges for Computer Science and
Linguistics, V ol. 6562 of Lecture Notes in Computer
Science, Springer, Berlin-Heidelberg, Germany,
2011, pp. 280–290.
https://doi.org/10.1007/
978-3-642-20095-3\_26
[38] R. Cowie, Describing the emotional states expressed
in speech, in: ISCA Workshop on Speech and Emo-
tion, Newcastle, uk, 2000, pp. 11–18.
164 Informatica 44 (2020) 147–165 M. Gr˚ uber et al.
[39] A. K. Syrdal, Y .-J. Kim, Dialog speech acts and
prosody: Considerations for TTS, in: Proceedings of
Speech Prosody, Campinas, Brazil, 2008, pp. 661–
665.
[40] M. G. Core, J. F. Allen, Coding dialogs with the
DAMSL annotation scheme, in: Working Notes of the
AAAI Fall Symposium on Communicative Action in
Humans and Machines, Cambridge, MA, USA, 1997,
pp. 28–35.
[41] J. Allen, M. Core, Draft of DAMSL: Dialog act
markup in several layers, WWW page, [online]
(1997).
[42] D. Jurafsky, L. Shrilberg, D. Biasca, Switchboard-
DAMSL labeling project coder’s manual, Tech. Rep.
97–02, University of Colorado, Institute of Cognitive
Science, Boulder, Colorado, USA (1997).
[43] S. Jekat, A. Klein, E. Maier, I. Maleck, M. Mast, J. J.
Quantz, Dialogue acts in VERBMOBIL, Tech. rep.,
German Research Center for Artiﬁcial Intelligence
(DFKI), Saarbrücken, Germany (1995).
[44] J. Alexandersson, B. Buschbeck-Wolf, T. Fuji-
nami, M. Kipp, S. Koch, E. Maier, N. Rei-
thinger, B. Schmitz, M. Siegel, Dialogue acts in
VERBMOBIL-2 - second edition, Tech. rep., German
Research Center for Artiﬁcial Intelligence (DFKI),
Saarbrücken, Germany (1998).
[45] A. P. Dempster, N. M. Laird, D. B. Rubin, Maxi-
mum likelihood from incomplete data via the EM al-
gorithm, J. Roy. Statist. Soc. Ser. B 39 (1) (1977) 1–
38, with discussion.
[46] J. Romportl, Prosodic phrases and semantic accents
in speech corpus for Czech TTS synthesis, in:
Text, Speech and Dialogue, proceedings of the
11th International Conference TSD, V ol. 5246 of
Lecture Notes in Artiﬁcial Intelligence, Springer,
Berlin–Heidelberg, Germany, 2008, pp. 493–500.
https://doi.org/10.1007/
978-3-540-87391-4_63
[47] J. L. Fleiss, Measuring nominal scale agreement
among many raters, Psychological Bulletin 76 (5)
(1971) 378–382.
https://doi.org/10.1037/h0031619
[48] J. L. Fleiss, J. Cohen, The equivalence of weighted
kappa and the intraclass correlation coefﬁcient as
measures of reliability, Educational and Psychologi-
cal Measurement 33 (3) (1973) 613–619.
https://doi.org/10.1177/
001316447303300309
[49] J. A. Cohen, A coefﬁcient of agreement for nominal
scales, Educational and Psychological Measurement
20 (1) (1960) 37–46.
https://doi.org/10.1177/
001316446002000104
[50] J. R. Landis, G. G. Koch, The measurement of ob-
server agreement for categorical data., Biometrics
33 (1) (1977) 159–174.
https://doi.org/10.2307/2529310
[51] D. Tihelka, J. Matoušek, Unit selection and its rela-
tion to symbolic prosody: a new approach, INTER-
SPEECH 2006 – ICSLP, proceedings of 9th Interna-
tional Conference on Spoken Language Procesing 1
(2006) 2042–2045.
[52] M. Gr˚ uber, Enumerating differences between vari-
ous communicative functions for purposes of Czech
expressive speech synthesis in limited domain, in:
Proceedings of Interspeech, Portland, Oregon, USA,
2012, pp. 650–653.
[53] M. Gr˚ uber, Acoustic analysis of Czech expressive
recordings from a single speaker in terms of various
communicative functions, in: Proceedings of the 11th
IEEE International Symposium on Signal Processing
and Information Technology, IEEE, 345 E 47TH ST,
NEW YORK, NY 10017, USA, 2011, pp. 267–272.
https://doi.org/10.1109/isspit.
2011.6151576
[54] L. Latacz, W. Mattheyses, W. Verhelst, Joint target
and join cost weight training for unit selection synthe-
sis, in: Proceedings of Interspeech, ISCA, Florence,
Italy, 2011, pp. 321–324.
[55] X. L. F. Alias, Evolutionary weight tuning for unit
selection based on diphone pairs, in: Proceedings of
Eurospeech, V ol. 2, Geneve, Switzerland, 2003, pp.
1333–1336.
[56] Z. Hanzlíˇ cek, Czech HMM-based speech synthesis,
in: Text, Speech and Dialogue, proceedings of
the 13th International Conference TSD, V ol. 6231
of Lecture Notes in Computer Science, Springer,
Berlin-Heidelberg, Germany, 2010, pp. 291–298.
https://doi.org/10.1007/
978-3-642-15760-8_37
[57] J. Nouza, J. Psutka, J. Uhlíˇ r, Phonetic alphabet for
speech recognition of czech, Radioengineering 6 (4)
(1997) 16–20.
[58] J. Romportl, J. Matoušek, D. Tihelka, Advanced
prosody modelling, in: Text, Speech and Dialogue,
proceedings of the 7th International Conference TSD,
V ol. 3206 of Lecture Notes in Artiﬁcial Intelligence,
Springer, Berlin-Heidelberg, Germany, 2004, pp.
441–447.
https://doi.org/10.1007/
978-3-540-30120-2_56
Dialogue Act-Based Expressive Speech Synthesis in. . . Informatica 44 (2020) 147–165 165
[59] J. Yamagishi, K. Onishi, T. Masuko, T. Kobayashi,
Modeling of various speaking styles and emotions
for HMM-based speech synthesis, in: Proceedings of
Eurospeech, Geneva, Switzerland, 2003, pp. 2461–
2464.
[60] K. Miyanaga, T. Masuko, T. Kobayashi, A style con-
trol technique for HMM-based speech synthesis, in:
Proceedings of Interspeech, 2004, pp. 1437–1440.
[61] T. Nose, Y . Kato, T. Kobayashi, A speaker adaptation
technique for MRHSMM-based style control of
synthetic speech, in: Proceedings of ICASSP, 2007,
pp. 833–836.
https://doi.org/10.1109/icassp.
2007.367042
[62] M. Gr˚ uber, Z. Hanzlíˇ cek, Czech expressive speech
synthesis in limited domain: Comparison of unit
selection and HMM-based approaches, in: Text,
Speech and Dialogue, V ol. 7499 of Lecture Notes
in Computer Science, Springer, Berlin-Heidelberg,
Germany, 2012, pp. 656–664.
https://doi.org/10.1007/
978-3-642-32790-2_80
[63] K. Tokuda, H. Zen, J. Yamagishi, T. Masuko, S. Sako,
A. W. Black, The HMM-based speech synthesis sys-
tem (HTS), [online].
166 Informatica 44 (2020) 147–165 M. Gr˚ uber et al.