https://doi.org/10.31449/inf.v47i7.4845 Informatica 47 (2023) 71–80 71 Design and Application of Neural Network-Based bp Algorithm in Speech Translation Robot Yuhan Jie * Department of Intelligent Sciences ,Artificial intelligence academy ,Nanchang Jiaotong Institute, Nanchang, China E-mail: 15180180173@163.com, YuhanJie202305@hotmail.com, *lsl851219@163.com * Corresponding author Keywords: speech translation, robots, technology, science, and medical care Abstract The process of turning spoken words from one language into another's spoken words is called speech translation. It entails analyzing the spoken words and producing an accurate translation in real-time utilizing cutting-edge algorithms and machine learning approaches. The usage of speech translation technology is widespread across several sectors, including travel, business, and healthcare. For instance, a doctor who speaks English may utilize a voice translation system to converse with a patient who speaks other languages, and a corporate executive would do the same while speaking with associates or customers abroad. Nowadays, speech translations are used in robots that are designed to translate speech from one language to another in real-time. in this paper, we implemented speech translation in the domain of intelligent science of medical care and technology. To assist English- speaking individuals in describing their symptoms to other language physicians or nurses, we suggested a neural network-based back propagation technique. Unlike laptops or tablets, a humanoid robot may be extended to reach out to individuals in need first and may eventually replace human labor. Finally, the neural-based control technique that was developed proved to be an efficient system for controlling human-robot voice translation, as judged both quantitatively and qualitatively. Results from a controlled trial demonstrating the translation's accuracy and success rate. Povzetek: Opisana je nova metoda govornega prevajanja, tj. v realnem času iz govora v enem jeziku v drugega, za potrebe zdravstva s pomočjo globokih nevronskih mrež. 1 Introduction The most popular techniques for voice translation systems are still to handle speech recognition and machine translation as independent modules. Although this method might cause error propagation and a drop in total efficiency, it is nonetheless popular since it enables the individual tuning of each component, improving performance as a whole. As neural models are better able to grasp the intricate structure of language and provide more natural-sounding translations, recent developments in neural machine translation have resulted in considerable advances in the quality of voice translation [1]. Techniques like regularizing, breaking sentences, smoothing, and anticipating punctuation may be included in the post-processing module. To lessen the influence of recognition mistakes on the translation, it could additionally incorporate error correction. By allowing real-time processing and more effective adaptation to various accents and speaking styles, artificial intelligence technology has the potential to substantially enhance voice translation systems [2]. Speech translation robots can be a useful tool for facilitating communication across language barriers. However, the effectiveness of a particular speech translation robot will depend on its performance across these different criteria, as well as the specific needs and context of its intended users. Illustrates the structure of speech translation, which is shown in Figure 1 Figure 1: Implementation of speech translation 72 Informatica 47 (2023) 71–80 Yuhan Jie The user experience might be greatly enhanced by a machine synchronous translation system that integrates voice recognition, machine translation, and cutting-edge AI methods, which would also help the online translation business grow [3]. Due to their physical limitations, the persons deal with a variety of issues in their day-to-day lives, such as difficulty carrying out routine duties that physically fit people take for granted. Robotics and other assistive technology may be very useful in assisting people with disabilities to enjoy more independent and happy lives. The creation of prosthetic arms and hands that can be operated by voice recognition or other non- intrusive techniques is one potential use of robotics in this context [4]. The equipment enables people with limb loss to carry out a variety of activities that would be challenging or impossible without it, such as handling things, using a keyboard, or even playing musical instruments. Overall, robots and speech processing have immense potential to enhance the independence, mobility, and capacity for social interaction of people with disabilities, thereby improving their quality of life. We can anticipate seeing more creative applications that assist disabled individuals overcome their obstacles as these technologies continue to advance and become more accessible [5]. It's a difficult challenge to create voice communication capabilities for robots that seem natural and nice, and it calls for complex verbal and non-verbal interaction skills. To accomplish successful human-robot interaction, speech recognition/synthesis, conversation management, and motion recognition/generation are all essential elements that must be combined and enhanced [6]. The development of high-quality synthetic speech that sounds natural and intuitive is of great relevance in the context of service robots that are intended for voice communication. However, current Text-to-Speech (TTS) systems are often designed more for text reading than for conversation, which may result in synthetic speech that sounds repetitive, artificial, or unwelcoming [7]. Additionally, since the majority of TTS systems are built on corpora that are mostly composed of monologues, synthetic speech may lack the liveliness of actual conversations. Researchers are investigating novel approaches for enhancing TTS systems for communication, such as gathering conversational corpora and creating more realistic intonation patterns, to solve these difficulties. They anticipate seeing increasingly sophisticated and successful voice communication systems in service robots and other interactive gadgets as these technologies continue to improve [8]. As we proposed, the BP approach based on neural networks might be utilized to create a speech translation robot. The applications offer a platform that may aid in the development of spoken language, making speech translation simpler and more interactive. The remaining portion of the manuscript is organized as follows: part 2 explains the prior study about the goals or objectives of the research and highlights any shortcomings or differences from it. Part 3 gives suggestions for further study based on the results and outlines the research technique and methods utilized to gather and assess data. We go through the Discussion and Findings in Part 4 before presenting the research findings succinctly and methodically, evaluating and describing them in light of the study goals or objectives. The Study's primary components are summarized in part 5, along with its relevance and contributions, possible implications for practice or policy, and prospective future research fields. 2 Application of speech translation The Research preferred applications based on the use of voice and visual technologies that are quickly evolving. Effective examination, swift and efficient system transfer, and appropriate medication prescriptions are a few examples. The article committed to providing a thorough analysis focusing on the requirements for the standpoint addressed, which quickly became available [9]. The case study showed Deep learning and intelligence are combined in the areas of gesture detection, voice translation, emotion identification, and intelligent robot navigation. A wide variety of recognition strategies have been proposed and put to the test in trials in related subject topics [10]. The researchers investigated the potential for developing the method, which was put into action and evaluated by the humanoid robot Pepper. A statistical study of the ratings offered by human volunteers who identified as belonging to diverse cultures is addressed, along with the preliminary results [11]. The Research evaluated that Text-to-Speech systems are designed for text reading rather than conversation; the robot utterances often sound repetitive, artificial, and unwelcoming. Here, we provide a robot voice synthesis system that is non-monologue [12]. The article conducted 24 people who participated in studies using the proposed methodology, which was tested in three situations: text reading, conversation, and domestic service robot (DSR) scenarios. The experimental findings demonstrated that, in the text-reading situation, the performance of our suggested technique was on par with the baseline method and surpassed it in the DSR scenario. The proposed system is a cloud-based voice synthesis service, making it free to use [13]. The Research determined electronics for the human-robot interfaces (HRI), which is necessary to develop thermal acoustic sound emission technology and triboelectric acoustic sensor system which may be employed as electrodes, thermal acoustic sources, or triboelectric materials [14]. The overview objective of the article is evaluated. By meticulously adjusting the structural parameters, the GHRI, which can recognize achieves significant sensitivity and operational durability [15].The research found that the artificial intelligence communication model which is also based on identified speech elements. The outcome demonstrated promising Design and Application of Neural Network-Based bp Algorithm… Informatica 47 (2023) 71–80 73 futures for the advancement of robotic intelligence [16]. The case study demonstrated Speech is a natural approach to communicating with social robots: Speak-language conversations may aid users in naturally and adaptably expressing their intentions. Speech recognition and natural language processing have made significant strides in artificial intelligence-related spoken conversation technologies over the last several years [17]. The Research presented the incorporation of open-source conversation management, voice recognition, and natural language processing components into a robot software platform, as well as a report on a pilot test of the combined system with actual users [18]. The case study extracted the dialogue management component of the robot utterance and combined the spoken content with the robot's gestures, which are also crucial in human-robot interaction. We picked mealtime conversations about food and recipes as the discourse domain since speaking with a companion robot in these situations is seen as natural and helpful [19]. The research extracted from the second edition of Robo-Identity will provide the chance to broaden the conversation around synthetic identity after the previous event's success. We have recently been concentrating on how words and voice may be used to portray emotions. Robotic sounds that mimic expressive human voices are getting more difficult to tell apart [20]. The case study established the possibility and limitation of using emotional speech to communicate a human-like identity that can deceive others and raise ethical concerns. Should robots keep a machine-like posture, such as via robotic speech, and should more human-like emotional displays be considered as design possibilities [21]? The research invited viewpoints on difficulties and possibilities from a range of sectors since they address mutually exclusive issues and the need for interdisciplinary research. Speech, emotion, and manufactured identity will be the particular focus of this year's event [22].The research may be gathered to deploy the companion robot after the input text is transformed directly via the deep neural network (DNN). Additionally, compared to other commercial humanoid companion robots, the development of voice translation robotic applications is discussed in the article [23]. The article showed that the two-layer fuzzy multiple random forests (RF) are capable of reliably identifying emotions which improved the efficiency of speech translation [24]. The case study was calculated by combining logistic regression (LR)-optimized weightings with speech, object, and motion confidence. After that, the paper combined the measurement with gaze tracking and conducted studies with real-world human-robot interaction. According to experimental findings, the suggested technique for robot- directed speech recognition performs well, with average recall and accuracy rates of 94% and 96%, respectively [25]. The study may have explored and highlighted various aspects of speech translation robotics that can be improved through the use of an NN-BP approach. Some specific passages in the study that could support the idea of improving the deployment of speech translation robotics could include: 1. This study aims to enhance the effectiveness of speech translation by utilizing the NN-BP approach and assessing its impact. 2. To evaluate the effectiveness of an approach, conducting experiments and comparing their performance and satisfaction is a common method used in many fields, such as machine learning, user experience, and product design. The related works summary is presented in Table 1. 3 Proposed methodology Speech translation robots utilize speech recognition technology to transform spoken words into text. They then apply natural language processing algorithms to examine the text and produce a translation in the desired language. Certain speech translation robots utilize speech synthesis technology to produce an audio rendition of the translated speech. In this section, we described speech translation of robot techniques in medical care, and our suggested method of neural networks based on a backpropagation algorithm is evaluated with some performance measures. 74 Informatica 47 (2023) 71–80 Yuhan Jie Table 1: Summary of related works Reference Description Limitation [10] The HCI method has developed from traditional print media to intelligent media. A never-ending stream of VR, AR, and AI interactive devices have appeared, bringing seismic changes to people's lives through gesture control, voice control, dialogue robots, and other means. HCI is just the beginning; More data from various sensors will be combined in the future. [11] Researchers examined the possibility of developing the procedure that Pepper, a humanoid robot, tested and evaluated. It’s very hard to predict [13] They proposed a cloud-based voice synthesis service, which was tested in three situations: text reading, conversation, and domestic service robot (DSR) scenarios They focus on speech-to-sign translation without offering a vice-versa option [19] They extracted the robot's utterance's dialogue management component and combined it with the robot's motions. It is only based on sounds, which means that the meanings of words may lose. [22] They discussed the possible advantages of including a socially supportive robot in speech therapy interventions. A few children's temporary dissatisfaction in a small number of training sessions [23] Emotion analysis uses the convolutional neural network to develop the robotic voice applications for translation facial recognition technology not used [24] The two-layer fuzzy multiple random forests (RF) have increased the effectiveness of voice translation by being able to accurately recognize emotions. The TLFMRF can use an intelligent optimization algorithm like the Genetic Algorithm (GA) in the future. [25] The study was derived by integrating logistic regression (LR)-optimized weightings with confidence in voice, object, and motion. It should be noted that the method's core concept 3.1 Design of the system The software used CMU Sphinx-4, a flexible and manageable open-source Java recognition of voices toolkit, to implement speech recognition. Based on the Java Speech Grammar Format, we developed a grammar- type speech framework for the voice recognizer to analyze. After accurate identification, the text is sent to a translator for processing. A sentence is then broken down into its parts by the translation algorithm. These parts include the subject at hand, verbs, things, and accommodate phrases. It then applies Chinese syntactic rules to the parsed elements and rearranges them. Finally, the DARwIn-OP will say the Korean version of the translated statement. Pre-recorded MP3 files were utilized since there is currently no suitable Korean TTS software that can be used for this application. The hash table is used to find the corresponding Chinese voice recording for each word. Due to sensor performance limitations in the DARwIn-OP and the need for system stability while simultaneously running translating and voice recognition programs, this concept was executed on a laptop computer. So, the humanoid machine and the computing device can talk to one another thanks to TCP/IP sockets connections. 3.2 CMU organize-4 speech recognizer CMU organize-4 is capable of voice recognition since a language model was created to identify proper phrases. Languages and statistical language models are both available via the CMU Sphinx-4 API for describing language. A statistical language model, the second kind, works well when there is a lot of data to work with. For now, however, we've decided to use the JSGF grammar approach as testing with a small number of example English sentences is sufficient. Sun Technologies' JSGF is a written representation of grammar that may be used to construct grammar for either complex phrases or basic instructions. Design and Application of Neural Network-Based bp Algorithm… Informatica 47 (2023) 71–80 75 The JSGF grammar files specify the grammar of legitimate speech as follows: Subject + predicate + object The voice translation algorithm fails if the data entered does not conform to the specified syntax. As a consequence, the voice recognition component selects only entire phrases for translation. We took for granted the typical scenario of a patient seeing a doctor for routine care. The recognizer's vocabulary of target words, phrases, and sentences was developed with this use case in mind. We spoke with the Dongguk University student healthcare facility to gain an overview of the most often reported ailments and the information that physicians need to prescribe effectively. Language files for stomachache, headaches, and the flu were defined based on this interview. Words and phrases from these dictionaries may be used to describe the individual's medical symptoms, such as the duration and severity of the individual's discomfort. 3.3 Algorithm of translation Studying three of the 7 patterns of English sentence building (Subject + Predicate + Complement + Object) was the focus of this investigation. S+P comes before C or O in English. In contrast, a standard Chinese phrase consists of S + [C | O] + P, with P placed at the final. This variation in word form was addressed throughout the alteration process. This system's English-Chinese software for translation is made up of 5 major modules: tokenization, part-of-speech tags, word grouping, Chinese language usage, and sentence-by-sentence interpretation. A.Tokenization: Spacing and articles ("a," "an," and "the"), which have no significance in Chinese, are used by the tokenization when possible module to divide the string into tokens once the words recognizer has finished its transcription. This transforms every token into a real word. B. Part of speech tagging: Parsing the phrases into their parts (subject, predicate, and object) requires first looking at the part of speech of each word. Each word is assigned a part of speech by looking it up in a database that categorizes words according to their meaning. Given the small sample size of input phrases, we can only focus on one sense of the term at a time. Words like "feel" are consistently identified as verbs in this system. Each word is now labeled as a noun, verb, or adjective based on this process. C. Structure sentence grouping: Each token is then assigned a part of the speech label before being sorted into sentences. Each clump of nouns and adverbs, for instance, produces a noun phrase. Time adverb phrases are categorized in the dictionary by using prepositions like "since," "for," and "from." It is conventional to treat the first noun word in a line as the participant and the second as a complement or object. An entirely verbal sentence functions as a predicate. D. Chinese grammar application: This stage entails rearranging parts in accordance with syntactic norms used in Korean. Prepositional phrases undergo a morphological and syntactic shift toward the Chinese norm. For example, prepositions like "for" and "since" are placed at the conclusion of a sentence in Chinese. A single Chinese character may represent many English phrases; thus, a manifestation same "runny nose" or "sore throat" becomes one token. E. Word-by-word translation: Each English sentence is then matched to a keyword in Chinese after the tokens have been organized in a Chinese sentence sequence. Since we do not use a TTS application and instead use matching pre-recorded audio files for each word, this process was streamlined by merging it with the following stage of Korean speech creation. 3.4 Generation of Chinese speech At this early point, a TTS system was not used in the creation of Korean speech. DARwIn-OP searches a hash table for each token after receiving them in their new order through TCP/IP from the client of the translation application. The DARwIn-OP uses a hash table where the "keys" are English words and the "values" are system links to Chinese voice recording files. Once the search is complete, an in-built DARwIn-OP library method is called to play a folder's worth of MP3s in order. Therefore, DARwIn-OP plays an MP3 file for each Chinese word, resulting in a whole Chinese phrase. 3.5 Design of neural network This study uses a 3-layer neural network, with an input layer, a hidden layer, and an output layer, as its basis in the neural network framework. The average grain size Z output (MD) is calculated from the input value W input"(GR, DEN, CNL, AC)," which is the responsiveness measure of the average rock grain size. The following is the format of its data: 76 Informatica 47 (2023) 71–80 Yuhan Jie 𝑊 𝑖𝑛𝑝𝑢𝑡 = {𝑤 11 , 𝑤 12 , … , 𝑤 𝑘𝑛 ⋮ ⋱ ⋮ 𝑤 𝑚 1 , 𝑤 𝑚 2, … , 𝑤 𝑚𝑛 } 𝑆 = {𝐻𝑄 1 , 𝐶𝐹𝑀 1 , 𝐷𝑀𝐾 1 , 𝐵 𝐷 1 , ⋮ ⋮ ⋮ ⋮ 𝐻𝑄 𝑚 , 𝐶𝐹𝑀 𝑚 , 𝐷𝑀𝐾 𝑚 , 𝐵𝐷 𝑚 , } 𝑆 (1) 𝑍 𝑜𝑢𝑡𝑝𝑢𝑡 = (𝑧 1 , 𝑧 2 , … 𝑧 𝑚 ) 𝑠 = (𝑁𝐶 1 , 𝑁𝐶 2 , … 𝑁𝐶 𝑚 ) 𝑆 (2) The neural network has four inputs and one output, therefore, m is the number of training observations, and n is the input sensitivity variable for each sample. The network's nonlinear mapping may be written as 𝑄 𝑚 → 𝑄 1 . Each concealed layer node's input is if the hidden layer is o, and else it is, 𝑇 𝑖 = ∑ 𝑛 𝑗 =1 𝑥 𝑗𝑖 𝑤 𝑗 − 𝜃 𝑖 (𝑖 = 1,2, … , 𝑜 ), (3) Where 𝑥 𝑗𝑖 is the weight of the link between the input and hidden layers. Node 𝜃 𝑖 in the hidden layer has a threshold value of 𝜃 𝑖 . For the chosen activation processes, the corresponding ReLU and tanh calculation formulae and derivative forms are as follows: 𝑅𝑒𝑙𝑢 (𝑤 ) = {𝑤 , (𝑤 > 0), 0, (𝑤 ≤ 0), (4) 𝑅𝑒𝑙𝑢 (𝑤 ) = {𝑤 , (𝑤 > 0), 0, (𝑤 ≤ 0), (5) 𝑇𝑎𝑛 ℎ(𝑤 ) = 𝑓 𝑤 −𝑓 −𝑤 𝑓 𝑤 +𝑓 −𝑤 , (6) 𝑇𝑎𝑛 ℎ(𝑤 ) = 1 − (𝑓 𝑤 −𝑓 −𝑤 ) 2 (𝑓 𝑤 +𝑓 −𝑤 ) 2 (7) ReLUis used as the usable mechanism in input and hidden layer networks because of the features of its data selection, which eliminate the issue of slow progressive descent in the learning procedure of neural networks and cause gradient descent to converge much quicker than other traditional activation functions. However, because of its unilateral inhibition, a ReLU neuron will no longer be activated on any data when a very big gradient passes through it and the parameters have been updated. As a result, the gradient of the neuron will then always be 0. The majority of the network's neurons will probably be inactive if the learning rate is too high, and the tanh activation function will only operate when the features are noticeably changed. The data output in the final output layer uses the tanh activation function since the feature effect will be constantly extended throughout the cycle. Figure 2: Structure of neural networks According to the average size of particles and sensitivity variables of low-permeability sandstone, both superficial and DNN were constructed. Deep neural networks have two additional concealed layers compared to superficial neural networks. The mechanisms for activation of the hidden and input layers of the neural networks are ReLU; the input is the sensitive performance of the average size, and the output is the median grain size. The loss function that is partly used to solve the regression issues is the MSE loss function. As a result, two neural networks were constructed and then developed by the data input for the organize-permeability sandstone medium-size forecast model. Figure 2 depicts the general framework of the neural network. 4 Results and discussion The experimental outcomes are determined by two metrics: the proportion of correct translations at both the initial and conclusion stages of the process. The success levels for the three primary processes: are English speech-to-text, word reorganization according to Chinese sentence structure, and English text-to-Chinese speech displayed in Figure 3 and the values are termed in Table 2. Design and Application of Neural Network-Based bp Algorithm… Informatica 47 (2023) 71–80 77 Figure 3: Success rate in breakpoint Table 2: Success rate in each breakpoint comparison The success rate in breakpoint English STT 70% China's rearranged sentence structure uses different terms 100% Chinese voice rendered in English text 100% 4.1 English language-to-text The efficiency of CMU Sphinx-4 and the translation model determine the outcome of this rate. The success rate for English speakers, both native and non-native, is 70% on average. The distinction between "have" and "had" in terms of phonemes is the fundamental reason why the rate decreased. Other phrases were virtually precisely dictated. 4.2 Chinese sentence structure-based word rearrangement If CMU Sphinx-4 is able to identify a spoken sentence as described in grammar files, it will be able to transform all of the example sentences provided in our grammatical files. We only allowed Subject + Predicate + [Complement | Object] phrases to be accepted as legitimate input, which made it possible. 4.3 English words to the Chinese language Assuming that the word rearranging was correct, the outcome of creating English words to Chinese voice worked flawlessly, as it matched the pre-saved Chinese word files to each restructured word in English. The percentage of accurate S2S translations is displayed in Figure 4 and Table 3. As can be seen above, this rate is affected by how well speech is recognized. CMU Sphinx4's acoustic model seems to be more accommodating to the native language of English since the success rate was 20 percent higher with the native language of English than the non-native language of English. Chinese translations that are accurate in terms of grammar and meaning can have awkward expressions. This restriction arose because the English elements were rearranged into Chinese word order and then translated word by word. Figure 4: The success level of S2S translation Table 3: The success level of S2S translation comparison Methods The success level of S2S translation The native language of English Non-native language of English DNN 80 61 RF 83 63 LR 85 65 NN-BP [Proposed] 89 69 4.4 Accuracy A performance indicator called accuracy assesses how accurate the system's predictions were overall. Accuracy in our BP algorithm-based platform refers to the system's capacity to accurately recognize learners' requirements and preferences and to provide them with the right kind of feedback and direction. High accuracy values obtained by our technique show that it is capable of providing accurate and useful forecasts. Our research indicates that our BP algorithm platform's accuracy is good enough to allow real-world applications in voice translation robots and that it may be further enhanced by including more representative and varied data, as well as by enhancing the rule set and feature selection. 78 Informatica 47 (2023) 71–80 Yuhan Jie 𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 = (𝑇𝑟𝑢𝑒𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒𝑠 + 𝑇𝑟𝑢𝑒𝑁𝑒𝑔𝑎𝑡𝑖𝑣𝑒𝑠 )/ (𝑇𝑟𝑢𝑒𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒𝑠 + 𝑇𝑟𝑢𝑒𝑛𝑒𝑔𝑎𝑡𝑖𝑣𝑒𝑠 + 𝐹𝑎𝑙𝑠𝑒𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒𝑠 + 𝐹𝑎𝑙𝑠𝑒𝑛𝑒𝑔𝑎𝑡𝑖 𝑣𝑒𝑠 ) = (𝑇𝑃 + 𝑇𝑁 )/(𝑇𝑃 + 𝑇𝑁 + 𝐹𝑃 + 𝐹𝑁 ) (8) Figure 5: Accuracy of proposed and existing method. Figure 5 displays the accuracy of the proposed and current methods. A percentage of the total is often used to represent the accuracy level. There are indicators of the possibility of erroneous forecasts in both the current methodology and the one that is being suggested. This threat is recognized by both systems. In comparison to DNN's accuracy of 66%, RF's accuracy of 83%, and LR's accuracy of 77%, the recommended technique, NN-BP, achieves an accuracy of 95%. Therefore, the strategy that is suggested has the best accuracy rate. The proposed approach accuracy is shown in Table 4 Table 4: Comparison of accuracy Methods Accuracy (%) DNN 66 RF 83 LR 77 NN-BP [Proposed] 95 4.5 Precision The algorithm's processing speed, resource efficiency, and capacity to deal with various accents, dialects, and languages. We would normally utilize metrics to assess the correctness of the system's transcription of spoken language when evaluating the precision of a neural network-based back propagation (BP) algorithm in a voice translation robot. Overall, a neural network-based BP algorithm's precision in a voice translation robot will be influenced by several variables, such as the amount and quality of training data, the model's architecture's complexity, and the efficiency of the learning and optimization methods used. 𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 = 𝑇𝑟𝑢𝑒𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒𝑠 /(𝑇𝑟𝑢𝑒𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒𝑠 + 𝐹𝑎𝑙 𝑠𝑒𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒𝑠 ) = 𝑇𝑃 /(𝑇𝑃 + 𝐹𝑃 ) (9) Figure 6: Accuracy of proposed and existing method. The precision of the proposed and existing approaches is shown in Figure 6. A percentage of the total is often used to represent precision levels. Both the existing approach and the one that is being discussed show signs of the potential for inaccurate estimates. This hazard is recognized by both systems. The suggested technique, however, has a 90% precision rate, compared to 63% for DNN, 77% for RF, and 81% for LR. Consequently, the proposed technique has the greatest Precision rate. Table 5 displays the proposed procedure's precision. Table 5: Comparison of precision Methods Precision (%) DNN 63 RF 77 LR 81 NN-BP [Proposed] 90 4.6 Recall The recall is a metric used in machine learning and information retrieval to assess how thorough or full the findings are. It calculates the proportion of relevant things that were found among a bigger collection of objects. How many of the right words or characters are properly identified and included in the output transcription is often the metric used to assess the recall of a neural network- based back propagation (BP) algorithm in a voice translation robot. 𝑅𝑒𝑐𝑎𝑙𝑙 = 𝐹𝑁 𝐹𝑁 +𝑇𝑃 (10) Design and Application of Neural Network-Based bp Algorithm… Informatica 47 (2023) 71–80 79 Figure 7: Recall of proposed and existing method. Figure 7 depicts the recall of the proposed and current methods. A percentage of the total is often used to represent the recall level. There are indicators of the possibility of erroneous forecasts in both the current methodology and the one that is being suggested. This threat is recognized by both systems. A recall rate of 97% is produced by the recommended method, NN-BP, as opposed to recall rates of 63% for DNN, 77% for RF, and merely 88% for LR. Therefore, the strategy that is suggested has the best accuracy rate. The proposed procedure recall is shown in Table 6. Table 6: Comparison of Recall Methods Recall (%) DNN 63 RF 77 LR 88 NN-BP [Proposed] 97 4.7 Discussion The research was derived by integrating logistic regression (LR)-optimized weightings with confidence in voice, object, and motion [10]. Artificial neural networks or regression algorithms can be used to estimate linear predictor coefficients as well as other critical parameters for quantized construction [11]. Based on GANs trained with converter audio and normalized key points retrieved from a chosen dataset, the system has been tested with people who identified with different cultures, and they provided favorable comments on its capacity to embed cultural features [19]. The power analysis demonstrates that the inquiry is limited by the preliminary exploratory results because of the small number of subjects [22]. The average emotion recognition accuracy is 77.82%. Additionally, the accuracy effectively increases and can reach up to 79.81% on average if the speech data is further analyzed by voice data augmentation to increase the total number of data [23]. The computation times for the TLFMRF, RF, and BPNN are 0.0579s, 0.0128s, and 0.0013s, respectively. Although the suggested technique has the longest computing time, it is still under a second, which can still ensure that real-time tracking accuracy is within acceptable bounds [24]. The approach is exceedingly efficient and fulfills a crucial need for natural and secure human-robot interaction [25]. The success rate of sentence-to-sentence translation is 89% when done by a native speaker of English but drops to 69% when done by someone whose first language is not English. 5 Conclusion Finally, patients who speak languages other than English may benefit greatly from the use of voice translation robots in hospitals. By offering precise and immediate interpretations of spoken language, these robots may aid healthcare practitioners in communicating more effectively with their patients.In medical environments where communication is essential, such as emergency rooms, intensive care units, and clinics, speech translation robots may be very helpful. These robots may assist in enhancing the standard of care and guaranteeing that patients get the right therapy by accurately translating symptoms among patients and medical histories for healthcare professionals.We have created the first humanistic application model that can be utilized for English-Chinese S2S translation in a medical setting to meet the needs of the ever-increasing number of patients from outside of China who want services for translation.DARwIn-OP, a humanoid robot, serves as a base because of the increased mobility and versatility that comes with this kind of robot in the workplace. Furthermore, it can communicate with anybody, including people who are unfamiliar with technological devices. As a result, it is anticipated that the system suggested in this work would be able to significantly increase its capability.Speech translation software and a rule-based machine translation approach were brought together to create this system. This software first employs a narrower model of CMU Sphinx-4 for voice recognition and then uses rule-based translation techniques to convert the detected English text into Chinese. S2S translation has an 89% success rate when using a native language of English but only a 69% success rate when using a non-native language of English. Currently, the translation functions flawlessly with a fixed voice recognition domain and predefined rules in the translation techniques. Therefore, the outcome relies on speech translation identification. 80 Informatica 47 (2023) 71–80 Yuhan Jie References [1] Tan, Y., (2022). Design of Intelligent Speech Translation System Based on Deep Learning. Mobile Information Systems. [2] Mathur, M., Samiulla, S., Bhat, V. and Jenitta, J., (2020), October. Design and Development of Writing Robot Using Speech Processing. In 2020 IEEE Bangalore Humanitarian Technology Conference (B-HTC) (pp. 1- 4). IEEE. [3] Shahin, I., Hindawi, N., Nassif, A.B., Alhudhaif, A. and Polat, K., (2022). Novel dual-channel long short-term memory compressed capsule networks for emotion recognition. Expert Systems with Applications, 188, p.116080. [4] Wang, Y., (2022). The Performance of Artificial Intelligence Translation App in Japanese Language Education Guided by Deep Learning. Computational Intelligence and Neuroscience. [5] Yuvaraj, S., Badholia, A., William, P., Vengatesan, K. and Bibave, R., (2022), May. Speech Recognition Based Robotic Arm Writing. In Proceedings of International Conference on Communication and Artificial Intelligence: ICCAI 2021 (pp. 23-33). Singapore: Springer Nature Singapore. [6] Shah, H.D., Sundas, A. and Sharma, S., (2021), September. Controlling Email System Using Audio with Speech Recognition and Text to Speech. In 2021 9th International Conference on Reliability, Infocom Technologies and Optimization (Trends and Future Directions)(ICRITO) (pp. 1-7). IEEE. [7] Kargathara, A., Vaidya, K. and Kumbharana, C.K., (2021). Analyzing desktop and mobile applications for text-to-speech conversation. In Rising Threats in Expert Applications and Solutions: Proceedings of FICR-TEAS 2020 (pp. 331-337). Springer Singapore. [8] Walker, N.T., Ultes, S. and Lison, P., (2022). GraphWOZ: Dialogue Management with Conversational Knowledge Graphs. arXiv preprint arXiv:2211.12852. [9] Yadav, S.P., Zaidi, S., Mishra, A. and Yadav, V., (2022). Survey on machine learning in speech emotion recognition and vision systems using a recurrent neural network (RNN). Archives of Computational Methods in Engineering, 29(3), pp.1753-1770. [10] Lv, Z., Poiesi, F., Dong, Q., Lloret, J. and Song, H., (2022). Deep Learning for Intelligent Human–Computer Interaction. Applied Sciences, 12(22), p.11457. [11] Delić, V., Perić, Z., Sečujski, M., Jakovljević, N., Nikolić, J., Mišković, D., Simić, N., Suzić, S. and Delić, T., (2019). Speech technology progress based on new machine learning paradigm. Computational intelligence and neuroscience. [12] Raheem, A.K.A. and Zuhair, M., (2023), March. Real- time speech recognition of Arabic language. In AIP Conference Proceedings (Vol. 2591, No. 1, p. 030018). AIP Publishing LLC. [13] Homburg, D., Thieme, M.S., Völker, J. and Stock, R., (2019). Robotalk-prototyping a humanoid robot as a speech-to-sign language translator. [14] Sun, H., Gao, X., Guo, L.Y., Tao, L.Q., Guo, Z.H., Shao, Y., Cui, T., Yang, Y., Pu, X. and Ren, T.L., (2023). Graphene‐based dual‐function acoustic transducers for machine learning‐assisted human–robot interfaces. InfoMat, 5(2), p.e12385. [15] Gkeka, E., Agorastou, E. and Drigas, A., (2019). Artificial Techniques for Language Disorders. Int. J. Recent Contributions Eng. Sci. IT, 7(4), pp.68-76. [16] Chen, X., (2021). Simulation of English speech emotion recognition based on transfer learning and CNN neural network. Journal of Intelligent & Fuzzy Systems, 40(2), pp.2349-2360. [17] Fujii, A. and Kristiina, J., (2022), March. Open source system integration towards natural interaction with robots. In 2022 17th ACM/IEEE International Conference on Human-Robot Interaction (HRI) (pp. 768-772). IEEE.Minato, T., Higashinaka, R., Sakai, K., Funayama, T., Nishizaki, H. and Nagai, T., (2022). Overview of dialogue robot competition 2022. arXiv preprint arXiv:2210.12863. [18] Gjaci, A., Recchiuto, C.T. and Sgorbissa, A., (2022). Towards Culture-Aware Co-Speech Gestures for Social Robots. International Journal of Social Robotics, 14(6), pp.1493-1506. [19] Laban, G., Le Maguer, S., Lee, M., Kontogiorgos, D., Reig, S., Torre, I., Tejwani, R., Dennis, M.J. and Pereira, A., (2022), March. Robo-identity: Exploring artificial identity and emotion via speech interactions. In 2022 17th ACM/IEEE International Conference on Human- Robot Interaction (HRI) (pp. 1265-1268). IEEE. [20] Guljajeva, V. and Canet Sola, M., (2022), October. Dream Painter: An Interactive Art Installation Bridging Audience Interaction, Robotics, and Creative AI. In Proceedings of the 30th ACM International Conference on Multimedia (pp. 7235-7236). [21] Esfandbod, A., Rokhi, Z., Meghdari, A.F., Taheri, A., Alemi, M. and Karimi, M., (2023). Utilizing Lee, M.C., Chiang, S.Y., Yeh, S.C. and Wen, T.F., (2020). Study on emotion recognition and companion Chatbot using deep neural network. Multimedia Tools and Applications, 79, pp.19629-19657. [22] Chen, L., Su, W., Feng, Y., Wu, M., She, J. and Hirota, K., (2020). Two-layer fuzzy multiple random forest for speech emotion recognition in human-robot interaction. Information Sciences, 509, pp.150-163. [23] Zuo, X., Iwahashi, N., Taguchi, R., Funakoshi, K., Nakano, M., Matsuda, S., Sugiura, K. and Oka, N., (2010), September. Detecting robot-directed speech by situated understanding in object manipulation tasks. In 19th International Symposium in Robot and Human Interactive Communication (pp. 608-613). IEEE.